mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2025-01-10 15:54:39 +08:00
9845f7409e
Remove the repeated word "the" in comments. Signed-off-by: Charles Han <hanchunchao@inspur.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
5316 lines
241 KiB
ReStructuredText
5316 lines
241 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
||
.. _xfs_online_fsck_design:
|
||
|
||
..
|
||
Mapping of heading styles within this document:
|
||
Heading 1 uses "====" above and below
|
||
Heading 2 uses "===="
|
||
Heading 3 uses "----"
|
||
Heading 4 uses "````"
|
||
Heading 5 uses "^^^^"
|
||
Heading 6 uses "~~~~"
|
||
Heading 7 uses "...."
|
||
|
||
Sections are manually numbered because apparently that's what everyone
|
||
does in the kernel.
|
||
|
||
======================
|
||
XFS Online Fsck Design
|
||
======================
|
||
|
||
This document captures the design of the online filesystem check feature for
|
||
XFS.
|
||
The purpose of this document is threefold:
|
||
|
||
- To help kernel distributors understand exactly what the XFS online fsck
|
||
feature is, and issues about which they should be aware.
|
||
|
||
- To help people reading the code to familiarize themselves with the relevant
|
||
concepts and design points before they start digging into the code.
|
||
|
||
- To help developers maintaining the system by capturing the reasons
|
||
supporting higher level decision making.
|
||
|
||
As the online fsck code is merged, the links in this document to topic branches
|
||
will be replaced with links to code.
|
||
|
||
This document is licensed under the terms of the GNU Public License, v2.
|
||
The primary author is Darrick J. Wong.
|
||
|
||
This design document is split into seven parts.
|
||
Part 1 defines what fsck tools are and the motivations for writing a new one.
|
||
Parts 2 and 3 present a high level overview of how online fsck process works
|
||
and how it is tested to ensure correct functionality.
|
||
Part 4 discusses the user interface and the intended usage modes of the new
|
||
program.
|
||
Parts 5 and 6 show off the high level components and how they fit together, and
|
||
then present case studies of how each repair function actually works.
|
||
Part 7 sums up what has been discussed so far and speculates about what else
|
||
might be built atop online fsck.
|
||
|
||
.. contents:: Table of Contents
|
||
:local:
|
||
|
||
1. What is a Filesystem Check?
|
||
==============================
|
||
|
||
A Unix filesystem has four main responsibilities:
|
||
|
||
- Provide a hierarchy of names through which application programs can associate
|
||
arbitrary blobs of data for any length of time,
|
||
|
||
- Virtualize physical storage media across those names, and
|
||
|
||
- Retrieve the named data blobs at any time.
|
||
|
||
- Examine resource usage.
|
||
|
||
Metadata directly supporting these functions (e.g. files, directories, space
|
||
mappings) are sometimes called primary metadata.
|
||
Secondary metadata (e.g. reverse mapping and directory parent pointers) support
|
||
operations internal to the filesystem, such as internal consistency checking
|
||
and reorganization.
|
||
Summary metadata, as the name implies, condense information contained in
|
||
primary metadata for performance reasons.
|
||
|
||
The filesystem check (fsck) tool examines all the metadata in a filesystem
|
||
to look for errors.
|
||
In addition to looking for obvious metadata corruptions, fsck also
|
||
cross-references different types of metadata records with each other to look
|
||
for inconsistencies.
|
||
People do not like losing data, so most fsck tools also contains some ability
|
||
to correct any problems found.
|
||
As a word of caution -- the primary goal of most Linux fsck tools is to restore
|
||
the filesystem metadata to a consistent state, not to maximize the data
|
||
recovered.
|
||
That precedent will not be challenged here.
|
||
|
||
Filesystems of the 20th century generally lacked any redundancy in the ondisk
|
||
format, which means that fsck can only respond to errors by erasing files until
|
||
errors are no longer detected.
|
||
More recent filesystem designs contain enough redundancy in their metadata that
|
||
it is now possible to regenerate data structures when non-catastrophic errors
|
||
occur; this capability aids both strategies.
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Note**: |
|
||
+--------------------------------------------------------------------------+
|
||
| System administrators avoid data loss by increasing the number of |
|
||
| separate storage systems through the creation of backups; and they avoid |
|
||
| downtime by increasing the redundancy of each storage system through the |
|
||
| creation of RAID arrays. |
|
||
| fsck tools address only the first problem. |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
TLDR; Show Me the Code!
|
||
-----------------------
|
||
|
||
Code is posted to the kernel.org git trees as follows:
|
||
`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
|
||
`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
|
||
`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
|
||
Each kernel patchset adding an online repair function will use the same branch
|
||
name across the kernel, xfsprogs, and fstests git repos.
|
||
|
||
Existing Tools
|
||
--------------
|
||
|
||
The online fsck tool described here will be the third tool in the history of
|
||
XFS (on Linux) to check and repair filesystems.
|
||
Two programs precede it:
|
||
|
||
The first program, ``xfs_check``, was created as part of the XFS debugger
|
||
(``xfs_db``) and can only be used with unmounted filesystems.
|
||
It walks all metadata in the filesystem looking for inconsistencies in the
|
||
metadata, though it lacks any ability to repair what it finds.
|
||
Due to its high memory requirements and inability to repair things, this
|
||
program is now deprecated and will not be discussed further.
|
||
|
||
The second program, ``xfs_repair``, was created to be faster and more robust
|
||
than the first program.
|
||
Like its predecessor, it can only be used with unmounted filesystems.
|
||
It uses extent-based in-memory data structures to reduce memory consumption,
|
||
and tries to schedule readahead IO appropriately to reduce I/O waiting time
|
||
while it scans the metadata of the entire filesystem.
|
||
The most important feature of this tool is its ability to respond to
|
||
inconsistencies in file metadata and directory tree by erasing things as needed
|
||
to eliminate problems.
|
||
Space usage metadata are rebuilt from the observed file metadata.
|
||
|
||
Problem Statement
|
||
-----------------
|
||
|
||
The current XFS tools leave several problems unsolved:
|
||
|
||
1. **User programs** suddenly **lose access** to the filesystem when unexpected
|
||
shutdowns occur as a result of silent corruptions in the metadata.
|
||
These occur **unpredictably** and often without warning.
|
||
|
||
2. **Users** experience a **total loss of service** during the recovery period
|
||
after an **unexpected shutdown** occurs.
|
||
|
||
3. **Users** experience a **total loss of service** if the filesystem is taken
|
||
offline to **look for problems** proactively.
|
||
|
||
4. **Data owners** cannot **check the integrity** of their stored data without
|
||
reading all of it.
|
||
This may expose them to substantial billing costs when a linear media scan
|
||
performed by the storage system administrator might suffice.
|
||
|
||
5. **System administrators** cannot **schedule** a maintenance window to deal
|
||
with corruptions if they **lack the means** to assess filesystem health
|
||
while the filesystem is online.
|
||
|
||
6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
|
||
health when doing so requires **manual intervention** and downtime.
|
||
|
||
7. **Users** can be tricked into **doing things they do not desire** when
|
||
malicious actors **exploit quirks of Unicode** to place misleading names
|
||
in directories.
|
||
|
||
Given this definition of the problems to be solved and the actors who would
|
||
benefit, the proposed solution is a third fsck tool that acts on a running
|
||
filesystem.
|
||
|
||
This new third program has three components: an in-kernel facility to check
|
||
metadata, an in-kernel facility to repair metadata, and a userspace driver
|
||
program to drive fsck activity on a live filesystem.
|
||
``xfs_scrub`` is the name of the driver program.
|
||
The rest of this document presents the goals and use cases of the new fsck
|
||
tool, describes its major design points in connection to those goals, and
|
||
discusses the similarities and differences with existing tools.
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Note**: |
|
||
+--------------------------------------------------------------------------+
|
||
| Throughout this document, the existing offline fsck tool can also be |
|
||
| referred to by its current name "``xfs_repair``". |
|
||
| The userspace driver program for the new online fsck tool can be |
|
||
| referred to as "``xfs_scrub``". |
|
||
| The kernel portion of online fsck that validates metadata is called |
|
||
| "online scrub", and portion of the kernel that fixes metadata is called |
|
||
| "online repair". |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
The naming hierarchy is broken up into objects known as directories and files
|
||
and the physical space is split into pieces known as allocation groups.
|
||
Sharding enables better performance on highly parallel systems and helps to
|
||
contain the damage when corruptions occur.
|
||
The division of the filesystem into principal objects (allocation groups and
|
||
inodes) means that there are ample opportunities to perform targeted checks and
|
||
repairs on a subset of the filesystem.
|
||
|
||
While this is going on, other parts continue processing IO requests.
|
||
Even if a piece of filesystem metadata can only be regenerated by scanning the
|
||
entire system, the scan can still be done in the background while other file
|
||
operations continue.
|
||
|
||
In summary, online fsck takes advantage of resource sharding and redundant
|
||
metadata to enable targeted checking and repair operations while the system
|
||
is running.
|
||
This capability will be coupled to automatic system management so that
|
||
autonomous self-healing of XFS maximizes service availability.
|
||
|
||
2. Theory of Operation
|
||
======================
|
||
|
||
Because it is necessary for online fsck to lock and scan live metadata objects,
|
||
online fsck consists of three separate code components.
|
||
The first is the userspace driver program ``xfs_scrub``, which is responsible
|
||
for identifying individual metadata items, scheduling work items for them,
|
||
reacting to the outcomes appropriately, and reporting results to the system
|
||
administrator.
|
||
The second and third are in the kernel, which implements functions to check
|
||
and repair each type of online fsck work item.
|
||
|
||
+------------------------------------------------------------------+
|
||
| **Note**: |
|
||
+------------------------------------------------------------------+
|
||
| For brevity, this document shortens the phrase "online fsck work |
|
||
| item" to "scrub item". |
|
||
+------------------------------------------------------------------+
|
||
|
||
Scrub item types are delineated in a manner consistent with the Unix design
|
||
philosophy, which is to say that each item should handle one aspect of a
|
||
metadata structure, and handle it well.
|
||
|
||
Scope
|
||
-----
|
||
|
||
In principle, online fsck should be able to check and to repair everything that
|
||
the offline fsck program can handle.
|
||
However, online fsck cannot be running 100% of the time, which means that
|
||
latent errors may creep in after a scrub completes.
|
||
If these errors cause the next mount to fail, offline fsck is the only
|
||
solution.
|
||
This limitation means that maintenance of the offline fsck tool will continue.
|
||
A second limitation of online fsck is that it must follow the same resource
|
||
sharing and lock acquisition rules as the regular filesystem.
|
||
This means that scrub cannot take *any* shortcuts to save time, because doing
|
||
so could lead to concurrency problems.
|
||
In other words, online fsck is not a complete replacement for offline fsck, and
|
||
a complete run of online fsck may take longer than online fsck.
|
||
However, both of these limitations are acceptable tradeoffs to satisfy the
|
||
different motivations of online fsck, which are to **minimize system downtime**
|
||
and to **increase predictability of operation**.
|
||
|
||
.. _scrubphases:
|
||
|
||
Phases of Work
|
||
--------------
|
||
|
||
The userspace driver program ``xfs_scrub`` splits the work of checking and
|
||
repairing an entire filesystem into seven phases.
|
||
Each phase concentrates on checking specific types of scrub items and depends
|
||
on the success of all previous phases.
|
||
The seven phases are as follows:
|
||
|
||
1. Collect geometry information about the mounted filesystem and computer,
|
||
discover the online fsck capabilities of the kernel, and open the
|
||
underlying storage devices.
|
||
|
||
2. Check allocation group metadata, all realtime volume metadata, and all quota
|
||
files.
|
||
Each metadata structure is scheduled as a separate scrub item.
|
||
If corruption is found in the inode header or inode btree and ``xfs_scrub``
|
||
is permitted to perform repairs, then those scrub items are repaired to
|
||
prepare for phase 3.
|
||
Repairs are implemented by using the information in the scrub item to
|
||
resubmit the kernel scrub call with the repair flag enabled; this is
|
||
discussed in the next section.
|
||
Optimizations and all other repairs are deferred to phase 4.
|
||
|
||
3. Check all metadata of every file in the filesystem.
|
||
Each metadata structure is also scheduled as a separate scrub item.
|
||
If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
|
||
and there were no problems detected during phase 2, then those scrub items
|
||
are repaired immediately.
|
||
Optimizations, deferred repairs, and unsuccessful repairs are deferred to
|
||
phase 4.
|
||
|
||
4. All remaining repairs and scheduled optimizations are performed during this
|
||
phase, if the caller permits them.
|
||
Before starting repairs, the summary counters are checked and any necessary
|
||
repairs are performed so that subsequent repairs will not fail the resource
|
||
reservation step due to wildly incorrect summary counters.
|
||
Unsuccessful repairs are requeued as long as forward progress on repairs is
|
||
made somewhere in the filesystem.
|
||
Free space in the filesystem is trimmed at the end of phase 4 if the
|
||
filesystem is clean.
|
||
|
||
5. By the start of this phase, all primary and secondary filesystem metadata
|
||
must be correct.
|
||
Summary counters such as the free space counts and quota resource counts
|
||
are checked and corrected.
|
||
Directory entry names and extended attribute names are checked for
|
||
suspicious entries such as control characters or confusing Unicode sequences
|
||
appearing in names.
|
||
|
||
6. If the caller asks for a media scan, read all allocated and written data
|
||
file extents in the filesystem.
|
||
The ability to use hardware-assisted data file integrity checking is new
|
||
to online fsck; neither of the previous tools have this capability.
|
||
If media errors occur, they will be mapped to the owning files and reported.
|
||
|
||
7. Re-check the summary counters and presents the caller with a summary of
|
||
space usage and file counts.
|
||
|
||
This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
|
||
later in this document.
|
||
|
||
Steps for Each Scrub Item
|
||
-------------------------
|
||
|
||
The kernel scrub code uses a three-step strategy for checking and repairing
|
||
the one aspect of a metadata object represented by a scrub item:
|
||
|
||
1. The scrub item of interest is checked for corruptions; opportunities for
|
||
optimization; and for values that are directly controlled by the system
|
||
administrator but look suspicious.
|
||
If the item is not corrupt or does not need optimization, resource are
|
||
released and the positive scan results are returned to userspace.
|
||
If the item is corrupt or could be optimized but the caller does not permit
|
||
this, resources are released and the negative scan results are returned to
|
||
userspace.
|
||
Otherwise, the kernel moves on to the second step.
|
||
|
||
2. The repair function is called to rebuild the data structure.
|
||
Repair functions generally choose rebuild a structure from other metadata
|
||
rather than try to salvage the existing structure.
|
||
If the repair fails, the scan results from the first step are returned to
|
||
userspace.
|
||
Otherwise, the kernel moves on to the third step.
|
||
|
||
3. In the third step, the kernel runs the same checks over the new metadata
|
||
item to assess the efficacy of the repairs.
|
||
The results of the reassessment are returned to userspace.
|
||
|
||
Classification of Metadata
|
||
--------------------------
|
||
|
||
Each type of metadata object (and therefore each type of scrub item) is
|
||
classified as follows:
|
||
|
||
Primary Metadata
|
||
````````````````
|
||
|
||
Metadata structures in this category should be most familiar to filesystem
|
||
users either because they are directly created by the user or they index
|
||
objects created by the user
|
||
Most filesystem objects fall into this class:
|
||
|
||
- Free space and reference count information
|
||
|
||
- Inode records and indexes
|
||
|
||
- Storage mapping information for file data
|
||
|
||
- Directories
|
||
|
||
- Extended attributes
|
||
|
||
- Symbolic links
|
||
|
||
- Quota limits
|
||
|
||
Scrub obeys the same rules as regular filesystem accesses for resource and lock
|
||
acquisition.
|
||
|
||
Primary metadata objects are the simplest for scrub to process.
|
||
The principal filesystem object (either an allocation group or an inode) that
|
||
owns the item being scrubbed is locked to guard against concurrent updates.
|
||
The check function examines every record associated with the type for obvious
|
||
errors and cross-references healthy records against other metadata to look for
|
||
inconsistencies.
|
||
Repairs for this class of scrub item are simple, since the repair function
|
||
starts by holding all the resources acquired in the previous step.
|
||
The repair function scans available metadata as needed to record all the
|
||
observations needed to complete the structure.
|
||
Next, it stages the observations in a new ondisk structure and commits it
|
||
atomically to complete the repair.
|
||
Finally, the storage from the old data structure are carefully reaped.
|
||
|
||
Because ``xfs_scrub`` locks a primary object for the duration of the repair,
|
||
this is effectively an offline repair operation performed on a subset of the
|
||
filesystem.
|
||
This minimizes the complexity of the repair code because it is not necessary to
|
||
handle concurrent updates from other threads, nor is it necessary to access
|
||
any other part of the filesystem.
|
||
As a result, indexed structures can be rebuilt very quickly, and programs
|
||
trying to access the damaged structure will be blocked until repairs complete.
|
||
The only infrastructure needed by the repair code are the staging area for
|
||
observations and a means to write new structures to disk.
|
||
Despite these limitations, the advantage that online repair holds is clear:
|
||
targeted work on individual shards of the filesystem avoids total loss of
|
||
service.
|
||
|
||
This mechanism is described in section 2.1 ("Off-Line Algorithm") of
|
||
V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
|
||
Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
|
||
*Extending Database Technology*, pp. 293-309, 1992.
|
||
|
||
Most primary metadata repair functions stage their intermediate results in an
|
||
in-memory array prior to formatting the new ondisk structure, which is very
|
||
similar to the list-based algorithm discussed in section 2.3 ("List-Based
|
||
Algorithms") of Srinivasan.
|
||
However, any data structure builder that maintains a resource lock for the
|
||
duration of the repair is *always* an offline algorithm.
|
||
|
||
.. _secondary_metadata:
|
||
|
||
Secondary Metadata
|
||
``````````````````
|
||
|
||
Metadata structures in this category reflect records found in primary metadata,
|
||
but are only needed for online fsck or for reorganization of the filesystem.
|
||
|
||
Secondary metadata include:
|
||
|
||
- Reverse mapping information
|
||
|
||
- Directory parent pointers
|
||
|
||
This class of metadata is difficult for scrub to process because scrub attaches
|
||
to the secondary object but needs to check primary metadata, which runs counter
|
||
to the usual order of resource acquisition.
|
||
Frequently, this means that full filesystems scans are necessary to rebuild the
|
||
metadata.
|
||
Check functions can be limited in scope to reduce runtime.
|
||
Repairs, however, require a full scan of primary metadata, which can take a
|
||
long time to complete.
|
||
Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
|
||
duration of the repair.
|
||
|
||
Instead, repair functions set up an in-memory staging structure to store
|
||
observations.
|
||
Depending on the requirements of the specific repair function, the staging
|
||
index will either have the same format as the ondisk structure or a design
|
||
specific to that repair function.
|
||
The next step is to release all locks and start the filesystem scan.
|
||
When the repair scanner needs to record an observation, the staging data are
|
||
locked long enough to apply the update.
|
||
While the filesystem scan is in progress, the repair function hooks the
|
||
filesystem so that it can apply pending filesystem updates to the staging
|
||
information.
|
||
Once the scan is done, the owning object is re-locked, the live data is used to
|
||
write a new ondisk structure, and the repairs are committed atomically.
|
||
The hooks are disabled and the staging staging area is freed.
|
||
Finally, the storage from the old data structure are carefully reaped.
|
||
|
||
Introducing concurrency helps online repair avoid various locking problems, but
|
||
comes at a high cost to code complexity.
|
||
Live filesystem code has to be hooked so that the repair function can observe
|
||
updates in progress.
|
||
The staging area has to become a fully functional parallel structure so that
|
||
updates can be merged from the hooks.
|
||
Finally, the hook, the filesystem scan, and the inode locking model must be
|
||
sufficiently well integrated that a hook event can decide if a given update
|
||
should be applied to the staging structure.
|
||
|
||
In theory, the scrub implementation could apply these same techniques for
|
||
primary metadata, but doing so would make it massively more complex and less
|
||
performant.
|
||
Programs attempting to access the damaged structures are not blocked from
|
||
operation, which may cause application failure or an unplanned filesystem
|
||
shutdown.
|
||
|
||
Inspiration for the secondary metadata repair strategy was drawn from section
|
||
2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
|
||
and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
|
||
Creating Indexes for Very Large Tables Without Quiescing Updates"
|
||
<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
|
||
|
||
The sidecar index mentioned above bears some resemblance to the side file
|
||
method mentioned in Srinivasan and Mohan.
|
||
Their method consists of an index builder that extracts relevant record data to
|
||
build the new structure as quickly as possible; and an auxiliary structure that
|
||
captures all updates that would be committed to the index by other threads were
|
||
the new index already online.
|
||
After the index building scan finishes, the updates recorded in the side file
|
||
are applied to the new index.
|
||
To avoid conflicts between the index builder and other writer threads, the
|
||
builder maintains a publicly visible cursor that tracks the progress of the
|
||
scan through the record space.
|
||
To avoid duplication of work between the side file and the index builder, side
|
||
file updates are elided when the record ID for the update is greater than the
|
||
cursor position within the record ID space.
|
||
|
||
To minimize changes to the rest of the codebase, XFS online repair keeps the
|
||
replacement index hidden until it's completely ready to go.
|
||
In other words, there is no attempt to expose the keyspace of the new index
|
||
while repair is running.
|
||
The complexity of such an approach would be very high and perhaps more
|
||
appropriate to building *new* indices.
|
||
|
||
**Future Work Question**: Can the full scan and live update code used to
|
||
facilitate a repair also be used to implement a comprehensive check?
|
||
|
||
*Answer*: In theory, yes. Check would be much stronger if each scrub function
|
||
employed these live scans to build a shadow copy of the metadata and then
|
||
compared the shadow records to the ondisk records.
|
||
However, doing that is a fair amount more work than what the checking functions
|
||
do now.
|
||
The live scans and hooks were developed much later.
|
||
That in turn increases the runtime of those scrub functions.
|
||
|
||
Summary Information
|
||
```````````````````
|
||
|
||
Metadata structures in this last category summarize the contents of primary
|
||
metadata records.
|
||
These are often used to speed up resource usage queries, and are many times
|
||
smaller than the primary metadata which they represent.
|
||
|
||
Examples of summary information include:
|
||
|
||
- Summary counts of free space and inodes
|
||
|
||
- File link counts from directories
|
||
|
||
- Quota resource usage counts
|
||
|
||
Check and repair require full filesystem scans, but resource and lock
|
||
acquisition follow the same paths as regular filesystem accesses.
|
||
|
||
The superblock summary counters have special requirements due to the underlying
|
||
implementation of the incore counters, and will be treated separately.
|
||
Check and repair of the other types of summary counters (quota resource counts
|
||
and file link counts) employ the same filesystem scanning and hooking
|
||
techniques as outlined above, but because the underlying data are sets of
|
||
integer counters, the staging data need not be a fully functional mirror of the
|
||
ondisk structure.
|
||
|
||
Inspiration for quota and file link count repair strategies were drawn from
|
||
sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
|
||
Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views
|
||
and Their Indexes"
|
||
<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
|
||
|
||
Since quotas are non-negative integer counts of resource usage, online
|
||
quotacheck can use the incremental view deltas described in section 2.14 to
|
||
track pending changes to the block and inode usage counts in each transaction,
|
||
and commit those changes to a dquot side file when the transaction commits.
|
||
Delta tracking is necessary for dquots because the index builder scans inodes,
|
||
whereas the data structure being rebuilt is an index of dquots.
|
||
Link count checking combines the view deltas and commit step into one because
|
||
it sets attributes of the objects being scanned instead of writing them to a
|
||
separate data structure.
|
||
Each online fsck function will be discussed as case studies later in this
|
||
document.
|
||
|
||
Risk Management
|
||
---------------
|
||
|
||
During the development of online fsck, several risk factors were identified
|
||
that may make the feature unsuitable for certain distributors and users.
|
||
Steps can be taken to mitigate or eliminate those risks, though at a cost to
|
||
functionality.
|
||
|
||
- **Decreased performance**: Adding metadata indices to the filesystem
|
||
increases the time cost of persisting changes to disk, and the reverse space
|
||
mapping and directory parent pointers are no exception.
|
||
System administrators who require the maximum performance can disable the
|
||
reverse mapping features at format time, though this choice dramatically
|
||
reduces the ability of online fsck to find inconsistencies and repair them.
|
||
|
||
- **Incorrect repairs**: As with all software, there might be defects in the
|
||
software that result in incorrect repairs being written to the filesystem.
|
||
Systematic fuzz testing (detailed in the next section) is employed by the
|
||
authors to find bugs early, but it might not catch everything.
|
||
The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
|
||
and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
|
||
accept this risk.
|
||
The xfsprogs build system has a configure option (``--enable-scrub=no``) that
|
||
disables building of the ``xfs_scrub`` binary, though this is not a risk
|
||
mitigation if the kernel functionality remains enabled.
|
||
|
||
- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
|
||
repairable.
|
||
If the keyspaces of several metadata indices overlap in some manner but a
|
||
coherent narrative cannot be formed from records collected, then the repair
|
||
fails.
|
||
To reduce the chance that a repair will fail with a dirty transaction and
|
||
render the filesystem unusable, the online repair functions have been
|
||
designed to stage and validate all new records before committing the new
|
||
structure.
|
||
|
||
- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
|
||
devices, opening files by handle, ignoring Unix discretionary access control,
|
||
and the ability to perform administrative changes.
|
||
Running this automatically in the background scares people, so the systemd
|
||
background service is configured to run with only the privileges required.
|
||
Obviously, this cannot address certain problems like the kernel crashing or
|
||
deadlocking, but it should be sufficient to prevent the scrub process from
|
||
escaping and reconfiguring the system.
|
||
The cron job does not have this protection.
|
||
|
||
- **Fuzz Kiddiez**: There are many people now who seem to think that running
|
||
automated fuzz testing of ondisk artifacts to find mischievous behavior and
|
||
spraying exploit code onto the public mailing list for instant zero-day
|
||
disclosure is somehow of some social benefit.
|
||
In the view of this author, the benefit is realized only when the fuzz
|
||
operators help to **fix** the flaws, but this opinion apparently is not
|
||
widely shared among security "researchers".
|
||
The XFS maintainers' continuing ability to manage these events presents an
|
||
ongoing risk to the stability of the development process.
|
||
Automated testing should front-load some of the risk while the feature is
|
||
considered EXPERIMENTAL.
|
||
|
||
Many of these risks are inherent to software programming.
|
||
Despite this, it is hoped that this new functionality will prove useful in
|
||
reducing unexpected downtime.
|
||
|
||
3. Testing Plan
|
||
===============
|
||
|
||
As stated before, fsck tools have three main goals:
|
||
|
||
1. Detect inconsistencies in the metadata;
|
||
|
||
2. Eliminate those inconsistencies; and
|
||
|
||
3. Minimize further loss of data.
|
||
|
||
Demonstrations of correct operation are necessary to build users' confidence
|
||
that the software behaves within expectations.
|
||
Unfortunately, it was not really feasible to perform regular exhaustive testing
|
||
of every aspect of a fsck tool until the introduction of low-cost virtual
|
||
machines with high-IOPS storage.
|
||
With ample hardware availability in mind, the testing strategy for the online
|
||
fsck project involves differential analysis against the existing fsck tools and
|
||
systematic testing of every attribute of every type of metadata object.
|
||
Testing can be split into four major categories, as discussed below.
|
||
|
||
Integrated Testing with fstests
|
||
-------------------------------
|
||
|
||
The primary goal of any free software QA effort is to make testing as
|
||
inexpensive and widespread as possible to maximize the scaling advantages of
|
||
community.
|
||
In other words, testing should maximize the breadth of filesystem configuration
|
||
scenarios and hardware setups.
|
||
This improves code quality by enabling the authors of online fsck to find and
|
||
fix bugs early, and helps developers of new features to find integration
|
||
issues earlier in their development effort.
|
||
|
||
The Linux filesystem community shares a common QA testing suite,
|
||
`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
|
||
functional and regression testing.
|
||
Even before development work began on online fsck, fstests (when run on XFS)
|
||
would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
|
||
scratch filesystems between each test.
|
||
This provides a level of assurance that the kernel and the fsck tools stay in
|
||
alignment about what constitutes consistent metadata.
|
||
During development of the online checking code, fstests was modified to run
|
||
``xfs_scrub -n`` between each test to ensure that the new checking code
|
||
produces the same results as the two existing fsck tools.
|
||
|
||
To start development of online repair, fstests was modified to run
|
||
``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
|
||
This ensures that offline repair does not crash, leave a corrupt filesystem
|
||
after it exists, or trigger complaints from the online check.
|
||
This also established a baseline for what can and cannot be repaired offline.
|
||
To complete the first phase of development of online repair, fstests was
|
||
modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
|
||
This enables a comparison of the effectiveness of online repair as compared to
|
||
the existing offline repair tools.
|
||
|
||
General Fuzz Testing of Metadata Blocks
|
||
---------------------------------------
|
||
|
||
XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
|
||
|
||
Before development of online fsck even began, a set of fstests were created
|
||
to test the rather common fault that entire metadata blocks get corrupted.
|
||
This required the creation of fstests library code that can create a filesystem
|
||
containing every possible type of metadata object.
|
||
Next, individual test cases were created to create a test filesystem, identify
|
||
a single block of a specific type of metadata object, trash it with the
|
||
existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
|
||
particular metadata validation strategy.
|
||
|
||
This earlier test suite enabled XFS developers to test the ability of the
|
||
in-kernel validation functions and the ability of the offline fsck tool to
|
||
detect and eliminate the inconsistent metadata.
|
||
This part of the test suite was extended to cover online fsck in exactly the
|
||
same manner.
|
||
|
||
In other words, for a given fstests filesystem configuration:
|
||
|
||
* For each metadata object existing on the filesystem:
|
||
|
||
* Write garbage to it
|
||
|
||
* Test the reactions of:
|
||
|
||
1. The kernel verifiers to stop obviously bad metadata
|
||
2. Offline repair (``xfs_repair``) to detect and fix
|
||
3. Online repair (``xfs_scrub``) to detect and fix
|
||
|
||
Targeted Fuzz Testing of Metadata Records
|
||
-----------------------------------------
|
||
|
||
The testing plan for online fsck includes extending the existing fs testing
|
||
infrastructure to provide a much more powerful facility: targeted fuzz testing
|
||
of every metadata field of every metadata object in the filesystem.
|
||
``xfs_db`` can modify every field of every metadata structure in every
|
||
block in the filesystem to simulate the effects of memory corruption and
|
||
software bugs.
|
||
Given that fstests already contains the ability to create a filesystem
|
||
containing every metadata format known to the filesystem, ``xfs_db`` can be
|
||
used to perform exhaustive fuzz testing!
|
||
|
||
For a given fstests filesystem configuration:
|
||
|
||
* For each metadata object existing on the filesystem...
|
||
|
||
* For each record inside that metadata object...
|
||
|
||
* For each field inside that record...
|
||
|
||
* For each conceivable type of transformation that can be applied to a bit field...
|
||
|
||
1. Clear all bits
|
||
2. Set all bits
|
||
3. Toggle the most significant bit
|
||
4. Toggle the middle bit
|
||
5. Toggle the least significant bit
|
||
6. Add a small quantity
|
||
7. Subtract a small quantity
|
||
8. Randomize the contents
|
||
|
||
* ...test the reactions of:
|
||
|
||
1. The kernel verifiers to stop obviously bad metadata
|
||
2. Offline checking (``xfs_repair -n``)
|
||
3. Offline repair (``xfs_repair``)
|
||
4. Online checking (``xfs_scrub -n``)
|
||
5. Online repair (``xfs_scrub``)
|
||
6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
|
||
|
||
This is quite the combinatoric explosion!
|
||
|
||
Fortunately, having this much test coverage makes it easy for XFS developers to
|
||
check the responses of XFS' fsck tools.
|
||
Since the introduction of the fuzz testing framework, these tests have been
|
||
used to discover incorrect repair code and missing functionality for entire
|
||
classes of metadata objects in ``xfs_repair``.
|
||
The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
|
||
confirming that ``xfs_repair`` could detect at least as many corruptions as
|
||
the older tool.
|
||
|
||
These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
|
||
allow the online fsck developers to compare online fsck against offline fsck,
|
||
and they enable XFS developers to find deficiencies in the code base.
|
||
|
||
Proposed patchsets include
|
||
`general fuzzer improvements
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
|
||
`fuzzing baselines
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
|
||
and `improvements in fuzz testing comprehensiveness
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
|
||
|
||
Stress Testing
|
||
--------------
|
||
|
||
A unique requirement to online fsck is the ability to operate on a filesystem
|
||
concurrently with regular workloads.
|
||
Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
|
||
impact on the running system, the online repair code should never introduce
|
||
inconsistencies into the filesystem metadata, and regular workloads should
|
||
never notice resource starvation.
|
||
To verify that these conditions are being met, fstests has been enhanced in
|
||
the following ways:
|
||
|
||
* For each scrub item type, create a test to exercise checking that item type
|
||
while running ``fsstress``.
|
||
* For each scrub item type, create a test to exercise repairing that item type
|
||
while running ``fsstress``.
|
||
* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
|
||
filesystem doesn't cause problems.
|
||
* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
|
||
force-repairing the whole filesystem doesn't cause problems.
|
||
* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
|
||
freezing and thawing the filesystem.
|
||
* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
|
||
remounting the filesystem read-only and read-write.
|
||
* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?)
|
||
|
||
Success is defined by the ability to run all of these tests without observing
|
||
any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
|
||
check warnings, or any other sort of mischief.
|
||
|
||
Proposed patchsets include `general stress testing
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
|
||
and the `evolution of existing per-function stress testing
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
|
||
|
||
4. User Interface
|
||
=================
|
||
|
||
The primary user of online fsck is the system administrator, just like offline
|
||
repair.
|
||
Online fsck presents two modes of operation to administrators:
|
||
A foreground CLI process for online fsck on demand, and a background service
|
||
that performs autonomous checking and repair.
|
||
|
||
Checking on Demand
|
||
------------------
|
||
|
||
For administrators who want the absolute freshest information about the
|
||
metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
|
||
a command line.
|
||
The program checks every piece of metadata in the filesystem while the
|
||
administrator waits for the results to be reported, just like the existing
|
||
``xfs_repair`` tool.
|
||
Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
|
||
option to increase the verbosity of the information reported.
|
||
|
||
A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
|
||
correction capabilities of the hardware to check data file contents.
|
||
The media scan is not enabled by default because it may dramatically increase
|
||
program runtime and consume a lot of bandwidth on older storage hardware.
|
||
|
||
The output of a foreground invocation is captured in the system log.
|
||
|
||
The ``xfs_scrub_all`` program walks the list of mounted filesystems and
|
||
initiates ``xfs_scrub`` for each of them in parallel.
|
||
It serializes scans for any filesystems that resolve to the same top level
|
||
kernel block device to prevent resource overconsumption.
|
||
|
||
Background Service
|
||
------------------
|
||
|
||
To reduce the workload of system administrators, the ``xfs_scrub`` package
|
||
provides a suite of `systemd <https://systemd.io/>`_ timers and services that
|
||
run online fsck automatically on weekends by default.
|
||
The background service configures scrub to run with as little privilege as
|
||
possible, the lowest CPU and IO priority, and in a CPU-constrained single
|
||
threaded mode.
|
||
This can be tuned by the systemd administrator at any time to suit the latency
|
||
and throughput requirements of customer workloads.
|
||
|
||
The output of the background service is also captured in the system log.
|
||
If desired, reports of failures (either due to inconsistencies or mere runtime
|
||
errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
|
||
variable in the following service files:
|
||
|
||
* ``xfs_scrub_fail@.service``
|
||
* ``xfs_scrub_media_fail@.service``
|
||
* ``xfs_scrub_all_fail.service``
|
||
|
||
The decision to enable the background scan is left to the system administrator.
|
||
This can be done by enabling either of the following services:
|
||
|
||
* ``xfs_scrub_all.timer`` on systemd systems
|
||
* ``xfs_scrub_all.cron`` on non-systemd systems
|
||
|
||
This automatic weekly scan is configured out of the box to perform an
|
||
additional media scan of all file data once per month.
|
||
This is less foolproof than, say, storing file data block checksums, but much
|
||
more performant if application software provides its own integrity checking,
|
||
redundancy can be provided elsewhere above the filesystem, or the storage
|
||
device's integrity guarantees are deemed sufficient.
|
||
|
||
The systemd unit file definitions have been subjected to a security audit
|
||
(as of systemd 249) to ensure that the xfs_scrub processes have as little
|
||
access to the rest of the system as possible.
|
||
This was performed via ``systemd-analyze security``, after which privileges
|
||
were restricted to the minimum required, sandboxing was set up to the maximal
|
||
extent possible with sandboxing and system call filtering; and access to the
|
||
filesystem tree was restricted to the minimum needed to start the program and
|
||
access the filesystem being scanned.
|
||
The service definition files restrict CPU usage to 80% of one CPU core, and
|
||
apply as nice of a priority to IO and CPU scheduling as possible.
|
||
This measure was taken to minimize delays in the rest of the filesystem.
|
||
No such hardening has been performed for the cron job.
|
||
|
||
Proposed patchset:
|
||
`Enabling the xfs_scrub background service
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
|
||
|
||
Health Reporting
|
||
----------------
|
||
|
||
XFS caches a summary of each filesystem's health status in memory.
|
||
The information is updated whenever ``xfs_scrub`` is run, or whenever
|
||
inconsistencies are detected in the filesystem metadata during regular
|
||
operations.
|
||
System administrators should use the ``health`` command of ``xfs_spaceman`` to
|
||
download this information into a human-readable format.
|
||
If problems have been observed, the administrator can schedule a reduced
|
||
service window to run the online repair tool to correct the problem.
|
||
Failing that, the administrator can decide to schedule a maintenance window to
|
||
run the traditional offline repair tool to correct the problem.
|
||
|
||
**Future Work Question**: Should the health reporting integrate with the new
|
||
inotify fs error notification system?
|
||
Would it be helpful for sysadmins to have a daemon to listen for corruption
|
||
notifications and initiate a repair?
|
||
|
||
*Answer*: These questions remain unanswered, but should be a part of the
|
||
conversation with early adopters and potential downstream users of XFS.
|
||
|
||
Proposed patchsets include
|
||
`wiring up health reports to correction returns
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
|
||
and
|
||
`preservation of sickness info during memory reclaim
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
|
||
|
||
5. Kernel Algorithms and Data Structures
|
||
========================================
|
||
|
||
This section discusses the key algorithms and data structures of the kernel
|
||
code that provide the ability to check and repair metadata while the system
|
||
is running.
|
||
The first chapters in this section reveal the pieces that provide the
|
||
foundation for checking metadata.
|
||
The remainder of this section presents the mechanisms through which XFS
|
||
regenerates itself.
|
||
|
||
Self Describing Metadata
|
||
------------------------
|
||
|
||
Starting with XFS version 5 in 2012, XFS updated the format of nearly every
|
||
ondisk block header to record a magic number, a checksum, a universally
|
||
"unique" identifier (UUID), an owner code, the ondisk address of the block,
|
||
and a log sequence number.
|
||
When loading a block buffer from disk, the magic number, UUID, owner, and
|
||
ondisk address confirm that the retrieved block matches the specific owner of
|
||
the current filesystem, and that the information contained in the block is
|
||
supposed to be found at the ondisk address.
|
||
The first three components enable checking tools to disregard alleged metadata
|
||
that doesn't belong to the filesystem, and the fourth component enables the
|
||
filesystem to detect lost writes.
|
||
|
||
Whenever a file system operation modifies a block, the change is submitted
|
||
to the log as part of a transaction.
|
||
The log then processes these transactions marking them done once they are
|
||
safely persisted to storage.
|
||
The logging code maintains the checksum and the log sequence number of the last
|
||
transactional update.
|
||
Checksums are useful for detecting torn writes and other discrepancies that can
|
||
be introduced between the computer and its storage devices.
|
||
Sequence number tracking enables log recovery to avoid applying out of date
|
||
log updates to the filesystem.
|
||
|
||
These two features improve overall runtime resiliency by providing a means for
|
||
the filesystem to detect obvious corruption when reading metadata blocks from
|
||
disk, but these buffer verifiers cannot provide any consistency checking
|
||
between metadata structures.
|
||
|
||
For more information, please see the documentation for
|
||
Documentation/filesystems/xfs-self-describing-metadata.rst
|
||
|
||
Reverse Mapping
|
||
---------------
|
||
|
||
The original design of XFS (circa 1993) is an improvement upon 1980s Unix
|
||
filesystem design.
|
||
In those days, storage density was expensive, CPU time was scarce, and
|
||
excessive seek time could kill performance.
|
||
For performance reasons, filesystem authors were reluctant to add redundancy to
|
||
the filesystem, even at the cost of data integrity.
|
||
Filesystems designers in the early 21st century choose different strategies to
|
||
increase internal redundancy -- either storing nearly identical copies of
|
||
metadata, or more space-efficient encoding techniques.
|
||
|
||
For XFS, a different redundancy strategy was chosen to modernize the design:
|
||
a secondary space usage index that maps allocated disk extents back to their
|
||
owners.
|
||
By adding a new index, the filesystem retains most of its ability to scale
|
||
well to heavily threaded workloads involving large datasets, since the primary
|
||
file metadata (the directory tree, the file block map, and the allocation
|
||
groups) remain unchanged.
|
||
Like any system that improves redundancy, the reverse-mapping feature increases
|
||
overhead costs for space mapping activities.
|
||
However, it has two critical advantages: first, the reverse index is key to
|
||
enabling online fsck and other requested functionality such as free space
|
||
defragmentation, better media failure reporting, and filesystem shrinking.
|
||
Second, the different ondisk storage format of the reverse mapping btree
|
||
defeats device-level deduplication because the filesystem requires real
|
||
redundancy.
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Sidebar**: |
|
||
+--------------------------------------------------------------------------+
|
||
| A criticism of adding the secondary index is that it does nothing to |
|
||
| improve the robustness of user data storage itself. |
|
||
| This is a valid point, but adding a new index for file data block |
|
||
| checksums increases write amplification by turning data overwrites into |
|
||
| copy-writes, which age the filesystem prematurely. |
|
||
| In keeping with thirty years of precedent, users who want file data |
|
||
| integrity can supply as powerful a solution as they require. |
|
||
| As for metadata, the complexity of adding a new secondary index of space |
|
||
| usage is much less than adding volume management and storage device |
|
||
| mirroring to XFS itself. |
|
||
| Perfection of RAID and volume management are best left to existing |
|
||
| layers in the kernel. |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
The information captured in a reverse space mapping record is as follows:
|
||
|
||
.. code-block:: c
|
||
|
||
struct xfs_rmap_irec {
|
||
xfs_agblock_t rm_startblock; /* extent start block */
|
||
xfs_extlen_t rm_blockcount; /* extent length */
|
||
uint64_t rm_owner; /* extent owner */
|
||
uint64_t rm_offset; /* offset within the owner */
|
||
unsigned int rm_flags; /* state flags */
|
||
};
|
||
|
||
The first two fields capture the location and size of the physical space,
|
||
in units of filesystem blocks.
|
||
The owner field tells scrub which metadata structure or file inode have been
|
||
assigned this space.
|
||
For space allocated to files, the offset field tells scrub where the space was
|
||
mapped within the file fork.
|
||
Finally, the flags field provides extra information about the space usage --
|
||
is this an attribute fork extent? A file mapping btree extent? Or an
|
||
unwritten data extent?
|
||
|
||
Online filesystem checking judges the consistency of each primary metadata
|
||
record by comparing its information against all other space indices.
|
||
The reverse mapping index plays a key role in the consistency checking process
|
||
because it contains a centralized alternate copy of all space allocation
|
||
information.
|
||
Program runtime and ease of resource acquisition are the only real limits to
|
||
what online checking can consult.
|
||
For example, a file data extent mapping can be checked against:
|
||
|
||
* The absence of an entry in the free space information.
|
||
* The absence of an entry in the inode index.
|
||
* The absence of an entry in the reference count data if the file is not
|
||
marked as having shared extents.
|
||
* The correspondence of an entry in the reverse mapping information.
|
||
|
||
There are several observations to make about reverse mapping indices:
|
||
|
||
1. Reverse mappings can provide a positive affirmation of correctness if any of
|
||
the above primary metadata are in doubt.
|
||
The checking code for most primary metadata follows a path similar to the
|
||
one outlined above.
|
||
|
||
2. Proving the consistency of secondary metadata with the primary metadata is
|
||
difficult because that requires a full scan of all primary space metadata,
|
||
which is very time intensive.
|
||
For example, checking a reverse mapping record for a file extent mapping
|
||
btree block requires locking the file and searching the entire btree to
|
||
confirm the block.
|
||
Instead, scrub relies on rigorous cross-referencing during the primary space
|
||
mapping structure checks.
|
||
|
||
3. Consistency scans must use non-blocking lock acquisition primitives if the
|
||
required locking order is not the same order used by regular filesystem
|
||
operations.
|
||
For example, if the filesystem normally takes a file ILOCK before taking
|
||
the AGF buffer lock but scrub wants to take a file ILOCK while holding
|
||
an AGF buffer lock, scrub cannot block on that second acquisition.
|
||
This means that forward progress during this part of a scan of the reverse
|
||
mapping data cannot be guaranteed if system load is heavy.
|
||
|
||
In summary, reverse mappings play a key role in reconstruction of primary
|
||
metadata.
|
||
The details of how these records are staged, written to disk, and committed
|
||
into the filesystem are covered in subsequent sections.
|
||
|
||
Checking and Cross-Referencing
|
||
------------------------------
|
||
|
||
The first step of checking a metadata structure is to examine every record
|
||
contained within the structure and its relationship with the rest of the
|
||
system.
|
||
XFS contains multiple layers of checking to try to prevent inconsistent
|
||
metadata from wreaking havoc on the system.
|
||
Each of these layers contributes information that helps the kernel to make
|
||
three decisions about the health of a metadata structure:
|
||
|
||
- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
|
||
- Is this structure inconsistent with the rest of the system
|
||
(``XFS_SCRUB_OFLAG_XCORRUPT``) ?
|
||
- Is there so much damage around the filesystem that cross-referencing is not
|
||
possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
|
||
- Can the structure be optimized to improve performance or reduce the size of
|
||
metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
|
||
- Does the structure contain data that is not inconsistent but deserves review
|
||
by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
|
||
|
||
The following sections describe how the metadata scrubbing process works.
|
||
|
||
Metadata Buffer Verification
|
||
````````````````````````````
|
||
|
||
The lowest layer of metadata protection in XFS are the metadata verifiers built
|
||
into the buffer cache.
|
||
These functions perform inexpensive internal consistency checking of the block
|
||
itself, and answer these questions:
|
||
|
||
- Does the block belong to this filesystem?
|
||
|
||
- Does the block belong to the structure that asked for the read?
|
||
This assumes that metadata blocks only have one owner, which is always true
|
||
in XFS.
|
||
|
||
- Is the type of data stored in the block within a reasonable range of what
|
||
scrub is expecting?
|
||
|
||
- Does the physical location of the block match the location it was read from?
|
||
|
||
- Does the block checksum match the data?
|
||
|
||
The scope of the protections here are very limited -- verifiers can only
|
||
establish that the filesystem code is reasonably free of gross corruption bugs
|
||
and that the storage system is reasonably competent at retrieval.
|
||
Corruption problems observed at runtime cause the generation of health reports,
|
||
failed system calls, and in the extreme case, filesystem shutdowns if the
|
||
corrupt metadata force the cancellation of a dirty transaction.
|
||
|
||
Every online fsck scrubbing function is expected to read every ondisk metadata
|
||
block of a structure in the course of checking the structure.
|
||
Corruption problems observed during a check are immediately reported to
|
||
userspace as corruption; during a cross-reference, they are reported as a
|
||
failure to cross-reference once the full examination is complete.
|
||
Reads satisfied by a buffer already in cache (and hence already verified)
|
||
bypass these checks.
|
||
|
||
Internal Consistency Checks
|
||
```````````````````````````
|
||
|
||
After the buffer cache, the next level of metadata protection is the internal
|
||
record verification code built into the filesystem.
|
||
These checks are split between the buffer verifiers, the in-filesystem users of
|
||
the buffer cache, and the scrub code itself, depending on the amount of higher
|
||
level context required.
|
||
The scope of checking is still internal to the block.
|
||
These higher level checking functions answer these questions:
|
||
|
||
- Does the type of data stored in the block match what scrub is expecting?
|
||
|
||
- Does the block belong to the owning structure that asked for the read?
|
||
|
||
- If the block contains records, do the records fit within the block?
|
||
|
||
- If the block tracks internal free space information, is it consistent with
|
||
the record areas?
|
||
|
||
- Are the records contained inside the block free of obvious corruptions?
|
||
|
||
Record checks in this category are more rigorous and more time-intensive.
|
||
For example, block pointers and inumbers are checked to ensure that they point
|
||
within the dynamically allocated parts of an allocation group and within
|
||
the filesystem.
|
||
Names are checked for invalid characters, and flags are checked for invalid
|
||
combinations.
|
||
Other record attributes are checked for sensible values.
|
||
Btree records spanning an interval of the btree keyspace are checked for
|
||
correct order and lack of mergeability (except for file fork mappings).
|
||
For performance reasons, regular code may skip some of these checks unless
|
||
debugging is enabled or a write is about to occur.
|
||
Scrub functions, of course, must check all possible problems.
|
||
|
||
Validation of Userspace-Controlled Record Attributes
|
||
````````````````````````````````````````````````````
|
||
|
||
Various pieces of filesystem metadata are directly controlled by userspace.
|
||
Because of this nature, validation work cannot be more precise than checking
|
||
that a value is within the possible range.
|
||
These fields include:
|
||
|
||
- Superblock fields controlled by mount options
|
||
- Filesystem labels
|
||
- File timestamps
|
||
- File permissions
|
||
- File size
|
||
- File flags
|
||
- Names present in directory entries, extended attribute keys, and filesystem
|
||
labels
|
||
- Extended attribute key namespaces
|
||
- Extended attribute values
|
||
- File data block contents
|
||
- Quota limits
|
||
- Quota timer expiration (if resource usage exceeds the soft limit)
|
||
|
||
Cross-Referencing Space Metadata
|
||
````````````````````````````````
|
||
|
||
After internal block checks, the next higher level of checking is
|
||
cross-referencing records between metadata structures.
|
||
For regular runtime code, the cost of these checks is considered to be
|
||
prohibitively expensive, but as scrub is dedicated to rooting out
|
||
inconsistencies, it must pursue all avenues of inquiry.
|
||
The exact set of cross-referencing is highly dependent on the context of the
|
||
data structure being checked.
|
||
|
||
The XFS btree code has keyspace scanning functions that online fsck uses to
|
||
cross reference one structure with another.
|
||
Specifically, scrub can scan the key space of an index to determine if that
|
||
keyspace is fully, sparsely, or not at all mapped to records.
|
||
For the reverse mapping btree, it is possible to mask parts of the key for the
|
||
purposes of performing a keyspace scan so that scrub can decide if the rmap
|
||
btree contains records mapping a certain extent of physical space without the
|
||
sparsenses of the rest of the rmap keyspace getting in the way.
|
||
|
||
Btree blocks undergo the following checks before cross-referencing:
|
||
|
||
- Does the type of data stored in the block match what scrub is expecting?
|
||
|
||
- Does the block belong to the owning structure that asked for the read?
|
||
|
||
- Do the records fit within the block?
|
||
|
||
- Are the records contained inside the block free of obvious corruptions?
|
||
|
||
- Are the name hashes in the correct order?
|
||
|
||
- Do node pointers within the btree point to valid block addresses for the type
|
||
of btree?
|
||
|
||
- Do child pointers point towards the leaves?
|
||
|
||
- Do sibling pointers point across the same level?
|
||
|
||
- For each node block record, does the record key accurate reflect the contents
|
||
of the child block?
|
||
|
||
Space allocation records are cross-referenced as follows:
|
||
|
||
1. Any space mentioned by any metadata structure are cross-referenced as
|
||
follows:
|
||
|
||
- Does the reverse mapping index list only the appropriate owner as the
|
||
owner of each block?
|
||
|
||
- Are none of the blocks claimed as free space?
|
||
|
||
- If these aren't file data blocks, are none of the blocks claimed as space
|
||
shared by different owners?
|
||
|
||
2. Btree blocks are cross-referenced as follows:
|
||
|
||
- Everything in class 1 above.
|
||
|
||
- If there's a parent node block, do the keys listed for this block match the
|
||
keyspace of this block?
|
||
|
||
- Do the sibling pointers point to valid blocks? Of the same level?
|
||
|
||
- Do the child pointers point to valid blocks? Of the next level down?
|
||
|
||
3. Free space btree records are cross-referenced as follows:
|
||
|
||
- Everything in class 1 and 2 above.
|
||
|
||
- Does the reverse mapping index list no owners of this space?
|
||
|
||
- Is this space not claimed by the inode index for inodes?
|
||
|
||
- Is it not mentioned by the reference count index?
|
||
|
||
- Is there a matching record in the other free space btree?
|
||
|
||
4. Inode btree records are cross-referenced as follows:
|
||
|
||
- Everything in class 1 and 2 above.
|
||
|
||
- Is there a matching record in free inode btree?
|
||
|
||
- Do cleared bits in the holemask correspond with inode clusters?
|
||
|
||
- Do set bits in the freemask correspond with inode records with zero link
|
||
count?
|
||
|
||
5. Inode records are cross-referenced as follows:
|
||
|
||
- Everything in class 1.
|
||
|
||
- Do all the fields that summarize information about the file forks actually
|
||
match those forks?
|
||
|
||
- Does each inode with zero link count correspond to a record in the free
|
||
inode btree?
|
||
|
||
6. File fork space mapping records are cross-referenced as follows:
|
||
|
||
- Everything in class 1 and 2 above.
|
||
|
||
- Is this space not mentioned by the inode btrees?
|
||
|
||
- If this is a CoW fork mapping, does it correspond to a CoW entry in the
|
||
reference count btree?
|
||
|
||
7. Reference count records are cross-referenced as follows:
|
||
|
||
- Everything in class 1 and 2 above.
|
||
|
||
- Within the space subkeyspace of the rmap btree (that is to say, all
|
||
records mapped to a particular space extent and ignoring the owner info),
|
||
are there the same number of reverse mapping records for each block as the
|
||
reference count record claims?
|
||
|
||
Proposed patchsets are the series to find gaps in
|
||
`refcount btree
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
|
||
`inode btree
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
|
||
`rmap btree
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
|
||
to find
|
||
`mergeable records
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
|
||
and to
|
||
`improve cross referencing with rmap
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
|
||
before starting a repair.
|
||
|
||
Checking Extended Attributes
|
||
````````````````````````````
|
||
|
||
Extended attributes implement a key-value store that enable fragments of data
|
||
to be attached to any file.
|
||
Both the kernel and userspace can access the keys and values, subject to
|
||
namespace and privilege restrictions.
|
||
Most typically these fragments are metadata about the file -- origins, security
|
||
contexts, user-supplied labels, indexing information, etc.
|
||
|
||
Names can be as long as 255 bytes and can exist in several different
|
||
namespaces.
|
||
Values can be as large as 64KB.
|
||
A file's extended attributes are stored in blocks mapped by the attr fork.
|
||
The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
|
||
Block 0 in the attribute fork is always the top of the structure, but otherwise
|
||
each of the three types of blocks can be found at any offset in the attr fork.
|
||
Leaf blocks contain attribute key records that point to the name and the value.
|
||
Names are always stored elsewhere in the same leaf block.
|
||
Values that are less than 3/4 the size of a filesystem block are also stored
|
||
elsewhere in the same leaf block.
|
||
Remote value blocks contain values that are too large to fit inside a leaf.
|
||
If the leaf information exceeds a single filesystem block, a dabtree (also
|
||
rooted at block 0) is created to map hashes of the attribute names to leaf
|
||
blocks in the attr fork.
|
||
|
||
Checking an extended attribute structure is not so straightforward due to the
|
||
lack of separation between attr blocks and index blocks.
|
||
Scrub must read each block mapped by the attr fork and ignore the non-leaf
|
||
blocks:
|
||
|
||
1. Walk the dabtree in the attr fork (if present) to ensure that there are no
|
||
irregularities in the blocks or dabtree mappings that do not point to
|
||
attr leaf blocks.
|
||
|
||
2. Walk the blocks of the attr fork looking for leaf blocks.
|
||
For each entry inside a leaf:
|
||
|
||
a. Validate that the name does not contain invalid characters.
|
||
|
||
b. Read the attr value.
|
||
This performs a named lookup of the attr name to ensure the correctness
|
||
of the dabtree.
|
||
If the value is stored in a remote block, this also validates the
|
||
integrity of the remote value block.
|
||
|
||
Checking and Cross-Referencing Directories
|
||
``````````````````````````````````````````
|
||
|
||
The filesystem directory tree is a directed acylic graph structure, with files
|
||
constituting the nodes, and directory entries (dirents) constituting the edges.
|
||
Directories are a special type of file containing a set of mappings from a
|
||
255-byte sequence (name) to an inumber.
|
||
These are called directory entries, or dirents for short.
|
||
Each directory file must have exactly one directory pointing to the file.
|
||
A root directory points to itself.
|
||
Directory entries point to files of any type.
|
||
Each non-directory file may have multiple directories point to it.
|
||
|
||
In XFS, directories are implemented as a file containing up to three 32GB
|
||
partitions.
|
||
The first partition contains directory entry data blocks.
|
||
Each data block contains variable-sized records associating a user-provided
|
||
name with an inumber and, optionally, a file type.
|
||
If the directory entry data grows beyond one block, the second partition (which
|
||
exists as post-EOF extents) is populated with a block containing free space
|
||
information and an index that maps hashes of the dirent names to directory data
|
||
blocks in the first partition.
|
||
This makes directory name lookups very fast.
|
||
If this second partition grows beyond one block, the third partition is
|
||
populated with a linear array of free space information for faster
|
||
expansions.
|
||
If the free space has been separated and the second partition grows again
|
||
beyond one block, then a dabtree is used to map hashes of dirent names to
|
||
directory data blocks.
|
||
|
||
Checking a directory is pretty straightforward:
|
||
|
||
1. Walk the dabtree in the second partition (if present) to ensure that there
|
||
are no irregularities in the blocks or dabtree mappings that do not point to
|
||
dirent blocks.
|
||
|
||
2. Walk the blocks of the first partition looking for directory entries.
|
||
Each dirent is checked as follows:
|
||
|
||
a. Does the name contain no invalid characters?
|
||
|
||
b. Does the inumber correspond to an actual, allocated inode?
|
||
|
||
c. Does the child inode have a nonzero link count?
|
||
|
||
d. If a file type is included in the dirent, does it match the type of the
|
||
inode?
|
||
|
||
e. If the child is a subdirectory, does the child's dotdot pointer point
|
||
back to the parent?
|
||
|
||
f. If the directory has a second partition, perform a named lookup of the
|
||
dirent name to ensure the correctness of the dabtree.
|
||
|
||
3. Walk the free space list in the third partition (if present) to ensure that
|
||
the free spaces it describes are really unused.
|
||
|
||
Checking operations involving :ref:`parents <dirparent>` and
|
||
:ref:`file link counts <nlinks>` are discussed in more detail in later
|
||
sections.
|
||
|
||
Checking Directory/Attribute Btrees
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
As stated in previous sections, the directory/attribute btree (dabtree) index
|
||
maps user-provided names to improve lookup times by avoiding linear scans.
|
||
Internally, it maps a 32-bit hash of the name to a block offset within the
|
||
appropriate file fork.
|
||
|
||
The internal structure of a dabtree closely resembles the btrees that record
|
||
fixed-size metadata records -- each dabtree block contains a magic number, a
|
||
checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
|
||
The format of leaf and node records are the same -- each entry points to the
|
||
next level down in the hierarchy, with dabtree node records pointing to dabtree
|
||
leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
|
||
in the fork.
|
||
|
||
Checking and cross-referencing the dabtree is very similar to what is done for
|
||
space btrees:
|
||
|
||
- Does the type of data stored in the block match what scrub is expecting?
|
||
|
||
- Does the block belong to the owning structure that asked for the read?
|
||
|
||
- Do the records fit within the block?
|
||
|
||
- Are the records contained inside the block free of obvious corruptions?
|
||
|
||
- Are the name hashes in the correct order?
|
||
|
||
- Do node pointers within the dabtree point to valid fork offsets for dabtree
|
||
blocks?
|
||
|
||
- Do leaf pointers within the dabtree point to valid fork offsets for directory
|
||
or attr leaf blocks?
|
||
|
||
- Do child pointers point towards the leaves?
|
||
|
||
- Do sibling pointers point across the same level?
|
||
|
||
- For each dabtree node record, does the record key accurate reflect the
|
||
contents of the child dabtree block?
|
||
|
||
- For each dabtree leaf record, does the record key accurate reflect the
|
||
contents of the directory or attr block?
|
||
|
||
Cross-Referencing Summary Counters
|
||
``````````````````````````````````
|
||
|
||
XFS maintains three classes of summary counters: available resources, quota
|
||
resource usage, and file link counts.
|
||
|
||
In theory, the amount of available resources (data blocks, inodes, realtime
|
||
extents) can be found by walking the entire filesystem.
|
||
This would make for very slow reporting, so a transactional filesystem can
|
||
maintain summaries of this information in the superblock.
|
||
Cross-referencing these values against the filesystem metadata should be a
|
||
simple matter of walking the free space and inode metadata in each AG and the
|
||
realtime bitmap, but there are complications that will be discussed in
|
||
:ref:`more detail <fscounters>` later.
|
||
|
||
:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
|
||
checking are sufficiently complicated to warrant separate sections.
|
||
|
||
Post-Repair Reverification
|
||
``````````````````````````
|
||
|
||
After performing a repair, the checking code is run a second time to validate
|
||
the new structure, and the results of the health assessment are recorded
|
||
internally and returned to the calling process.
|
||
This step is critical for enabling system administrator to monitor the status
|
||
of the filesystem and the progress of any repairs.
|
||
For developers, it is a useful means to judge the efficacy of error detection
|
||
and correction in the online and offline checking tools.
|
||
|
||
Eventual Consistency vs. Online Fsck
|
||
------------------------------------
|
||
|
||
Complex operations can make modifications to multiple per-AG data structures
|
||
with a chain of transactions.
|
||
These chains, once committed to the log, are restarted during log recovery if
|
||
the system crashes while processing the chain.
|
||
Because the AG header buffers are unlocked between transactions within a chain,
|
||
online checking must coordinate with chained operations that are in progress to
|
||
avoid incorrectly detecting inconsistencies due to pending chains.
|
||
Furthermore, online repair must not run when operations are pending because
|
||
the metadata are temporarily inconsistent with each other, and rebuilding is
|
||
not possible.
|
||
|
||
Only online fsck has this requirement of total consistency of AG metadata, and
|
||
should be relatively rare as compared to filesystem change operations.
|
||
Online fsck coordinates with transaction chains as follows:
|
||
|
||
* For each AG, maintain a count of intent items targeting that AG.
|
||
The count should be bumped whenever a new item is added to the chain.
|
||
The count should be dropped when the filesystem has locked the AG header
|
||
buffers and finished the work.
|
||
|
||
* When online fsck wants to examine an AG, it should lock the AG header
|
||
buffers to quiesce all transaction chains that want to modify that AG.
|
||
If the count is zero, proceed with the checking operation.
|
||
If it is nonzero, cycle the buffer locks to allow the chain to make forward
|
||
progress.
|
||
|
||
This may lead to online fsck taking a long time to complete, but regular
|
||
filesystem updates take precedence over background checking activity.
|
||
Details about the discovery of this situation are presented in the
|
||
:ref:`next section <chain_coordination>`, and details about the solution
|
||
are presented :ref:`after that<intent_drains>`.
|
||
|
||
.. _chain_coordination:
|
||
|
||
Discovery of the Problem
|
||
````````````````````````
|
||
|
||
Midway through the development of online scrubbing, the fsstress tests
|
||
uncovered a misinteraction between online fsck and compound transaction chains
|
||
created by other writer threads that resulted in false reports of metadata
|
||
inconsistency.
|
||
The root cause of these reports is the eventual consistency model introduced by
|
||
the expansion of deferred work items and compound transaction chains when
|
||
reverse mapping and reflink were introduced.
|
||
|
||
Originally, transaction chains were added to XFS to avoid deadlocks when
|
||
unmapping space from files.
|
||
Deadlock avoidance rules require that AGs only be locked in increasing order,
|
||
which makes it impossible (say) to use a single transaction to free a space
|
||
extent in AG 7 and then try to free a now superfluous block mapping btree block
|
||
in AG 3.
|
||
To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
|
||
items to commit to freeing some space in one transaction while deferring the
|
||
actual metadata updates to a fresh transaction.
|
||
The transaction sequence looks like this:
|
||
|
||
1. The first transaction contains a physical update to the file's block mapping
|
||
structures to remove the mapping from the btree blocks.
|
||
It then attaches to the in-memory transaction an action item to schedule
|
||
deferred freeing of space.
|
||
Concretely, each transaction maintains a list of ``struct
|
||
xfs_defer_pending`` objects, each of which maintains a list of ``struct
|
||
xfs_extent_free_item`` objects.
|
||
Returning to the example above, the action item tracks the freeing of both
|
||
the unmapped space from AG 7 and the block mapping btree (BMBT) block from
|
||
AG 3.
|
||
Deferred frees recorded in this manner are committed in the log by creating
|
||
an EFI log item from the ``struct xfs_extent_free_item`` object and
|
||
attaching the log item to the transaction.
|
||
When the log is persisted to disk, the EFI item is written into the ondisk
|
||
transaction record.
|
||
EFIs can list up to 16 extents to free, all sorted in AG order.
|
||
|
||
2. The second transaction contains a physical update to the free space btrees
|
||
of AG 3 to release the former BMBT block and a second physical update to the
|
||
free space btrees of AG 7 to release the unmapped file space.
|
||
Observe that the physical updates are resequenced in the correct order
|
||
when possible.
|
||
Attached to the transaction is a an extent free done (EFD) log item.
|
||
The EFD contains a pointer to the EFI logged in transaction #1 so that log
|
||
recovery can tell if the EFI needs to be replayed.
|
||
|
||
If the system goes down after transaction #1 is written back to the filesystem
|
||
but before #2 is committed, a scan of the filesystem metadata would show
|
||
inconsistent filesystem metadata because there would not appear to be any owner
|
||
of the unmapped space.
|
||
Happily, log recovery corrects this inconsistency for us -- when recovery finds
|
||
an intent log item but does not find a corresponding intent done item, it will
|
||
reconstruct the incore state of the intent item and finish it.
|
||
In the example above, the log must replay both frees described in the recovered
|
||
EFI to complete the recovery phase.
|
||
|
||
There are subtleties to XFS' transaction chaining strategy to consider:
|
||
|
||
* Log items must be added to a transaction in the correct order to prevent
|
||
conflicts with principal objects that are not held by the transaction.
|
||
In other words, all per-AG metadata updates for an unmapped block must be
|
||
completed before the last update to free the extent, and extents should not
|
||
be reallocated until that last update commits to the log.
|
||
|
||
* AG header buffers are released between each transaction in a chain.
|
||
This means that other threads can observe an AG in an intermediate state,
|
||
but as long as the first subtlety is handled, this should not affect the
|
||
correctness of filesystem operations.
|
||
|
||
* Unmounting the filesystem flushes all pending work to disk, which means that
|
||
offline fsck never sees the temporary inconsistencies caused by deferred
|
||
work item processing.
|
||
|
||
In this manner, XFS employs a form of eventual consistency to avoid deadlocks
|
||
and increase parallelism.
|
||
|
||
During the design phase of the reverse mapping and reflink features, it was
|
||
decided that it was impractical to cram all the reverse mapping updates for a
|
||
single filesystem change into a single transaction because a single file
|
||
mapping operation can explode into many small updates:
|
||
|
||
* The block mapping update itself
|
||
* A reverse mapping update for the block mapping update
|
||
* Fixing the freelist
|
||
* A reverse mapping update for the freelist fix
|
||
|
||
* A shape change to the block mapping btree
|
||
* A reverse mapping update for the btree update
|
||
* Fixing the freelist (again)
|
||
* A reverse mapping update for the freelist fix
|
||
|
||
* An update to the reference counting information
|
||
* A reverse mapping update for the refcount update
|
||
* Fixing the freelist (a third time)
|
||
* A reverse mapping update for the freelist fix
|
||
|
||
* Freeing any space that was unmapped and not owned by any other file
|
||
* Fixing the freelist (a fourth time)
|
||
* A reverse mapping update for the freelist fix
|
||
|
||
* Freeing the space used by the block mapping btree
|
||
* Fixing the freelist (a fifth time)
|
||
* A reverse mapping update for the freelist fix
|
||
|
||
Free list fixups are not usually needed more than once per AG per transaction
|
||
chain, but it is theoretically possible if space is very tight.
|
||
For copy-on-write updates this is even worse, because this must be done once to
|
||
remove the space from a staging area and again to map it into the file!
|
||
|
||
To deal with this explosion in a calm manner, XFS expands its use of deferred
|
||
work items to cover most reverse mapping updates and all refcount updates.
|
||
This reduces the worst case size of transaction reservations by breaking the
|
||
work into a long chain of small updates, which increases the degree of eventual
|
||
consistency in the system.
|
||
Again, this generally isn't a problem because XFS orders its deferred work
|
||
items carefully to avoid resource reuse conflicts between unsuspecting threads.
|
||
|
||
However, online fsck changes the rules -- remember that although physical
|
||
updates to per-AG structures are coordinated by locking the buffers for AG
|
||
headers, buffer locks are dropped between transactions.
|
||
Once scrub acquires resources and takes locks for a data structure, it must do
|
||
all the validation work without releasing the lock.
|
||
If the main lock for a space btree is an AG header buffer lock, scrub may have
|
||
interrupted another thread that is midway through finishing a chain.
|
||
For example, if a thread performing a copy-on-write has completed a reverse
|
||
mapping update but not the corresponding refcount update, the two AG btrees
|
||
will appear inconsistent to scrub and an observation of corruption will be
|
||
recorded. This observation will not be correct.
|
||
If a repair is attempted in this state, the results will be catastrophic!
|
||
|
||
Several other solutions to this problem were evaluated upon discovery of this
|
||
flaw and rejected:
|
||
|
||
1. Add a higher level lock to allocation groups and require writer threads to
|
||
acquire the higher level lock in AG order before making any changes.
|
||
This would be very difficult to implement in practice because it is
|
||
difficult to determine which locks need to be obtained, and in what order,
|
||
without simulating the entire operation.
|
||
Performing a dry run of a file operation to discover necessary locks would
|
||
make the filesystem very slow.
|
||
|
||
2. Make the deferred work coordinator code aware of consecutive intent items
|
||
targeting the same AG and have it hold the AG header buffers locked across
|
||
the transaction roll between updates.
|
||
This would introduce a lot of complexity into the coordinator since it is
|
||
only loosely coupled with the actual deferred work items.
|
||
It would also fail to solve the problem because deferred work items can
|
||
generate new deferred subtasks, but all subtasks must be complete before
|
||
work can start on a new sibling task.
|
||
|
||
3. Teach online fsck to walk all transactions waiting for whichever lock(s)
|
||
protect the data structure being scrubbed to look for pending operations.
|
||
The checking and repair operations must factor these pending operations into
|
||
the evaluations being performed.
|
||
This solution is a nonstarter because it is *extremely* invasive to the main
|
||
filesystem.
|
||
|
||
.. _intent_drains:
|
||
|
||
Intent Drains
|
||
`````````````
|
||
|
||
Online fsck uses an atomic intent item counter and lock cycling to coordinate
|
||
with transaction chains.
|
||
There are two key properties to the drain mechanism.
|
||
First, the counter is incremented when a deferred work item is *queued* to a
|
||
transaction, and it is decremented after the associated intent done log item is
|
||
*committed* to another transaction.
|
||
The second property is that deferred work can be added to a transaction without
|
||
holding an AG header lock, but per-AG work items cannot be marked done without
|
||
locking that AG header buffer to log the physical updates and the intent done
|
||
log item.
|
||
The first property enables scrub to yield to running transaction chains, which
|
||
is an explicit deprioritization of online fsck to benefit file operations.
|
||
The second property of the drain is key to the correct coordination of scrub,
|
||
since scrub will always be able to decide if a conflict is possible.
|
||
|
||
For regular filesystem code, the drain works as follows:
|
||
|
||
1. Call the appropriate subsystem function to add a deferred work item to a
|
||
transaction.
|
||
|
||
2. The function calls ``xfs_defer_drain_bump`` to increase the counter.
|
||
|
||
3. When the deferred item manager wants to finish the deferred work item, it
|
||
calls ``->finish_item`` to complete it.
|
||
|
||
4. The ``->finish_item`` implementation logs some changes and calls
|
||
``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
|
||
waiting on the drain.
|
||
|
||
5. The subtransaction commits, which unlocks the resource associated with the
|
||
intent item.
|
||
|
||
For scrub, the drain works as follows:
|
||
|
||
1. Lock the resource(s) associated with the metadata being scrubbed.
|
||
For example, a scan of the refcount btree would lock the AGI and AGF header
|
||
buffers.
|
||
|
||
2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
|
||
chains in progress and the operation may proceed.
|
||
|
||
3. Otherwise, release the resources grabbed in step 1.
|
||
|
||
4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
|
||
back to step 1 unless a signal has been caught.
|
||
|
||
To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
|
||
be woken up whenever the intent count drops to zero.
|
||
|
||
The proposed patchset is the
|
||
`scrub intent drain series
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
|
||
|
||
.. _jump_labels:
|
||
|
||
Static Keys (aka Jump Label Patching)
|
||
`````````````````````````````````````
|
||
|
||
Online fsck for XFS separates the regular filesystem from the checking and
|
||
repair code as much as possible.
|
||
However, there are a few parts of online fsck (such as the intent drains, and
|
||
later, live update hooks) where it is useful for the online fsck code to know
|
||
what's going on in the rest of the filesystem.
|
||
Since it is not expected that online fsck will be constantly running in the
|
||
background, it is very important to minimize the runtime overhead imposed by
|
||
these hooks when online fsck is compiled into the kernel but not actively
|
||
running on behalf of userspace.
|
||
Taking locks in the hot path of a writer thread to access a data structure only
|
||
to find that no further action is necessary is expensive -- on the author's
|
||
computer, this have an overhead of 40-50ns per access.
|
||
Fortunately, the kernel supports dynamic code patching, which enables XFS to
|
||
replace a static branch to hook code with ``nop`` sleds when online fsck isn't
|
||
running.
|
||
This sled has an overhead of however long it takes the instruction decoder to
|
||
skip past the sled, which seems to be on the order of less than 1ns and
|
||
does not access memory outside of instruction fetching.
|
||
|
||
When online fsck enables the static key, the sled is replaced with an
|
||
unconditional branch to call the hook code.
|
||
The switchover is quite expensive (~22000ns) but is paid entirely by the
|
||
program that invoked online fsck, and can be amortized if multiple threads
|
||
enter online fsck at the same time, or if multiple filesystems are being
|
||
checked at the same time.
|
||
Changing the branch direction requires taking the CPU hotplug lock, and since
|
||
CPU initialization requires memory allocation, online fsck must be careful not
|
||
to change a static key while holding any locks or resources that could be
|
||
accessed in the memory reclaim paths.
|
||
To minimize contention on the CPU hotplug lock, care should be taken not to
|
||
enable or disable static keys unnecessarily.
|
||
|
||
Because static keys are intended to minimize hook overhead for regular
|
||
filesystem operations when xfs_scrub is not running, the intended usage
|
||
patterns are as follows:
|
||
|
||
- The hooked part of XFS should declare a static-scoped static key that
|
||
defaults to false.
|
||
The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
|
||
The static key itself should be declared as a ``static`` variable.
|
||
|
||
- When deciding to invoke code that's only used by scrub, the regular
|
||
filesystem should call the ``static_branch_unlikely`` predicate to avoid the
|
||
scrub-only hook code if the static key is not enabled.
|
||
|
||
- The regular filesystem should export helper functions that call
|
||
``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
|
||
static key.
|
||
Wrapper functions make it easy to compile out the relevant code if the kernel
|
||
distributor turns off online fsck at build time.
|
||
|
||
- Scrub functions wanting to turn on scrub-only XFS functionality should call
|
||
the ``xchk_fsgates_enable`` from the setup function to enable a specific
|
||
hook.
|
||
This must be done before obtaining any resources that are used by memory
|
||
reclaim.
|
||
Callers had better be sure they really need the functionality gated by the
|
||
static key; the ``TRY_HARDER`` flag is useful here.
|
||
|
||
Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
|
||
handle locking AGI and AGF buffers for all scrubber functions.
|
||
If it detects a conflict between scrub and the running transactions, it will
|
||
try to wait for intents to complete.
|
||
If the caller of the helper has not enabled the static key, the helper will
|
||
return -EDEADLOCK, which should result in the scrub being restarted with the
|
||
``TRY_HARDER`` flag set.
|
||
The scrub setup function should detect that flag, enable the static key, and
|
||
try the scrub again.
|
||
Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
|
||
|
||
For more information, please see the kernel documentation of
|
||
Documentation/staging/static-keys.rst.
|
||
|
||
.. _xfile:
|
||
|
||
Pageable Kernel Memory
|
||
----------------------
|
||
|
||
Some online checking functions work by scanning the filesystem to build a
|
||
shadow copy of an ondisk metadata structure in memory and comparing the two
|
||
copies.
|
||
For online repair to rebuild a metadata structure, it must compute the record
|
||
set that will be stored in the new structure before it can persist that new
|
||
structure to disk.
|
||
Ideally, repairs complete with a single atomic commit that introduces
|
||
a new data structure.
|
||
To meet these goals, the kernel needs to collect a large amount of information
|
||
in a place that doesn't require the correct operation of the filesystem.
|
||
|
||
Kernel memory isn't suitable because:
|
||
|
||
* Allocating a contiguous region of memory to create a C array is very
|
||
difficult, especially on 32-bit systems.
|
||
|
||
* Linked lists of records introduce double pointer overhead which is very high
|
||
and eliminate the possibility of indexed lookups.
|
||
|
||
* Kernel memory is pinned, which can drive the system into OOM conditions.
|
||
|
||
* The system might not have sufficient memory to stage all the information.
|
||
|
||
At any given time, online fsck does not need to keep the entire record set in
|
||
memory, which means that individual records can be paged out if necessary.
|
||
Continued development of online fsck demonstrated that the ability to perform
|
||
indexed data storage would also be very useful.
|
||
Fortunately, the Linux kernel already has a facility for byte-addressable and
|
||
pageable storage: tmpfs.
|
||
In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
|
||
to store intermediate data that doesn't need to be in memory at all times, so
|
||
that usage precedent is already established.
|
||
Hence, the ``xfile`` was born!
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Historical Sidebar**: |
|
||
+--------------------------------------------------------------------------+
|
||
| The first edition of online repair inserted records into a new btree as |
|
||
| it found them, which failed because filesystem could shut down with a |
|
||
| built data structure, which would be live after recovery finished. |
|
||
| |
|
||
| The second edition solved the half-rebuilt structure problem by storing |
|
||
| everything in memory, but frequently ran the system out of memory. |
|
||
| |
|
||
| The third edition solved the OOM problem by using linked lists, but the |
|
||
| memory overhead of the list pointers was extreme. |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
xfile Access Models
|
||
```````````````````
|
||
|
||
A survey of the intended uses of xfiles suggested these use cases:
|
||
|
||
1. Arrays of fixed-sized records (space management btrees, directory and
|
||
extended attribute entries)
|
||
|
||
2. Sparse arrays of fixed-sized records (quotas and link counts)
|
||
|
||
3. Large binary objects (BLOBs) of variable sizes (directory and extended
|
||
attribute names and values)
|
||
|
||
4. Staging btrees in memory (reverse mapping btrees)
|
||
|
||
5. Arbitrary contents (realtime space management)
|
||
|
||
To support the first four use cases, high level data structures wrap the xfile
|
||
to share functionality between online fsck functions.
|
||
The rest of this section discusses the interfaces that the xfile presents to
|
||
four of those five higher level data structures.
|
||
The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
|
||
study.
|
||
|
||
The most general storage interface supported by the xfile enables the reading
|
||
and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
|
||
This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
|
||
which behave similarly to their userspace counterparts.
|
||
XFS is very record-based, which suggests that the ability to load and store
|
||
complete records is important.
|
||
To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
|
||
functions are provided to read and persist objects into an xfile.
|
||
They are internally the same as pread and pwrite, except that they treat any
|
||
error as an out of memory error.
|
||
For online repair, squashing error conditions in this manner is an acceptable
|
||
behavior because the only reaction is to abort the operation back to userspace.
|
||
All five xfile usecases can be serviced by these four functions.
|
||
|
||
However, no discussion of file access idioms is complete without answering the
|
||
question, "But what about mmap?"
|
||
It is convenient to access storage directly with pointers, just like userspace
|
||
code does with regular memory.
|
||
Online fsck must not drive the system into OOM conditions, which means that
|
||
xfiles must be responsive to memory reclamation.
|
||
tmpfs can only push a pagecache folio to the swap cache if the folio is neither
|
||
pinned nor locked, which means the xfile must not pin too many folios.
|
||
|
||
Short term direct access to xfile contents is done by locking the pagecache
|
||
folio and mapping it into kernel address space.
|
||
Programmatic access (e.g. pread and pwrite) uses this mechanism.
|
||
Folio locks are not supposed to be held for long periods of time, so long
|
||
term direct access to xfile contents is done by bumping the folio refcount,
|
||
mapping it into kernel address space, and dropping the folio lock.
|
||
These long term users *must* be responsive to memory reclaim by hooking into
|
||
the shrinker infrastructure to know when to release folios.
|
||
|
||
The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
|
||
retrieve the (locked) folio that backs part of an xfile and to release it.
|
||
The only code to use these folio lease functions are the xfarray
|
||
:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
|
||
btrees<xfbtree>`.
|
||
|
||
xfile Access Coordination
|
||
`````````````````````````
|
||
|
||
For security reasons, xfiles must be owned privately by the kernel.
|
||
They are marked ``S_PRIVATE`` to prevent interference from the security system,
|
||
must never be mapped into process file descriptor tables, and their pages must
|
||
never be mapped into userspace processes.
|
||
|
||
To avoid locking recursion issues with the VFS, all accesses to the shmfs file
|
||
are performed by manipulating the page cache directly.
|
||
xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
|
||
xfile's address space to grab writable pages, copy the caller's buffer into the
|
||
page, and release the pages.
|
||
xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
|
||
before copying the contents into the caller's buffer.
|
||
In other words, xfiles ignore the VFS read and write code paths to avoid
|
||
having to create a dummy ``struct kiocb`` and to avoid taking inode and
|
||
freeze locks.
|
||
tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
|
||
|
||
If an xfile is shared between threads to stage repairs, the caller must provide
|
||
its own locks to coordinate access.
|
||
For example, if a scrub function stores scan results in an xfile and needs
|
||
other threads to provide updates to the scanned data, the scrub function must
|
||
provide a lock for all threads to share.
|
||
|
||
.. _xfarray:
|
||
|
||
Arrays of Fixed-Sized Records
|
||
`````````````````````````````
|
||
|
||
In XFS, each type of indexed space metadata (free space, inodes, reference
|
||
counts, file fork space, and reverse mappings) consists of a set of fixed-size
|
||
records indexed with a classic B+ tree.
|
||
Directories have a set of fixed-size dirent records that point to the names,
|
||
and extended attributes have a set of fixed-size attribute keys that point to
|
||
names and values.
|
||
Quota counters and file link counters index records with numbers.
|
||
During a repair, scrub needs to stage new records during the gathering step and
|
||
retrieve them during the btree building step.
|
||
|
||
Although this requirement can be satisfied by calling the read and write
|
||
methods of the xfile directly, it is simpler for callers for there to be a
|
||
higher level abstraction to take care of computing array offsets, to provide
|
||
iterator functions, and to deal with sparse records and sorting.
|
||
The ``xfarray`` abstraction presents a linear array for fixed-size records atop
|
||
the byte-accessible xfile.
|
||
|
||
.. _xfarray_access_patterns:
|
||
|
||
Array Access Patterns
|
||
^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Array access patterns in online fsck tend to fall into three categories.
|
||
Iteration of records is assumed to be necessary for all cases and will be
|
||
covered in the next section.
|
||
|
||
The first type of caller handles records that are indexed by position.
|
||
Gaps may exist between records, and a record may be updated multiple times
|
||
during the collection step.
|
||
In other words, these callers want a sparse linearly addressed table file.
|
||
The typical use case are quota records or file link count records.
|
||
Access to array elements is performed programmatically via ``xfarray_load`` and
|
||
``xfarray_store`` functions, which wrap the similarly-named xfile functions to
|
||
provide loading and storing of array elements at arbitrary array indices.
|
||
Gaps are defined to be null records, and null records are defined to be a
|
||
sequence of all zero bytes.
|
||
Null records are detected by calling ``xfarray_element_is_null``.
|
||
They are created either by calling ``xfarray_unset`` to null out an existing
|
||
record or by never storing anything to an array index.
|
||
|
||
The second type of caller handles records that are not indexed by position
|
||
and do not require multiple updates to a record.
|
||
The typical use case here is rebuilding space btrees and key/value btrees.
|
||
These callers can add records to the array without caring about array indices
|
||
via the ``xfarray_append`` function, which stores a record at the end of the
|
||
array.
|
||
For callers that require records to be presentable in a specific order (e.g.
|
||
rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
|
||
records; this function will be covered later.
|
||
|
||
The third type of caller is a bag, which is useful for counting records.
|
||
The typical use case here is constructing space extent reference counts from
|
||
reverse mapping information.
|
||
Records can be put in the bag in any order, they can be removed from the bag
|
||
at any time, and uniqueness of records is left to callers.
|
||
The ``xfarray_store_anywhere`` function is used to insert a record in any
|
||
null record slot in the bag; and the ``xfarray_unset`` function removes a
|
||
record from the bag.
|
||
|
||
The proposed patchset is the
|
||
`big in-memory array
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
|
||
|
||
Iterating Array Elements
|
||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Most users of the xfarray require the ability to iterate the records stored in
|
||
the array.
|
||
Callers can probe every possible array index with the following:
|
||
|
||
.. code-block:: c
|
||
|
||
xfarray_idx_t i;
|
||
foreach_xfarray_idx(array, i) {
|
||
xfarray_load(array, i, &rec);
|
||
|
||
/* do something with rec */
|
||
}
|
||
|
||
All users of this idiom must be prepared to handle null records or must already
|
||
know that there aren't any.
|
||
|
||
For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
|
||
function ignores indices in the xfarray that have never been written to by
|
||
calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
|
||
of the array that are not populated with memory pages.
|
||
Once it finds a page, it will skip the zeroed areas of the page.
|
||
|
||
.. code-block:: c
|
||
|
||
xfarray_idx_t i = XFARRAY_CURSOR_INIT;
|
||
while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
|
||
/* do something with rec */
|
||
}
|
||
|
||
.. _xfarray_sort:
|
||
|
||
Sorting Array Elements
|
||
^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
During the fourth demonstration of online repair, a community reviewer remarked
|
||
that for performance reasons, online repair ought to load batches of records
|
||
into btree record blocks instead of inserting records into a new btree one at a
|
||
time.
|
||
The btree insertion code in XFS is responsible for maintaining correct ordering
|
||
of the records, so naturally the xfarray must also support sorting the record
|
||
set prior to bulk loading.
|
||
|
||
Case Study: Sorting xfarrays
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The sorting algorithm used in the xfarray is actually a combination of adaptive
|
||
quicksort and a heapsort subalgorithm in the spirit of
|
||
`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
|
||
`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
|
||
kernel.
|
||
To sort records in a reasonably short amount of time, ``xfarray`` takes
|
||
advantage of the binary subpartitioning offered by quicksort, but it also uses
|
||
heapsort to hedge against performance collapse if the chosen quicksort pivots
|
||
are poor.
|
||
Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
|
||
gulf between the two implementations.
|
||
|
||
The Linux kernel already contains a reasonably fast implementation of heapsort.
|
||
It only operates on regular C arrays, which limits the scope of its usefulness.
|
||
There are two key places where the xfarray uses it:
|
||
|
||
* Sorting any record subset backed by a single xfile page.
|
||
|
||
* Loading a small number of xfarray records from potentially disparate parts
|
||
of the xfarray into a memory buffer, and sorting the buffer.
|
||
|
||
In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
|
||
quicksort, thereby mitigating quicksort's worst runtime behavior.
|
||
|
||
Choosing a quicksort pivot is a tricky business.
|
||
A good pivot splits the set to sort in half, leading to the divide and conquer
|
||
behavior that is crucial to O(n * lg(n)) performance.
|
||
A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
|
||
runtime.
|
||
The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
|
||
records into a memory buffer and using the kernel heapsort to identify the
|
||
median of the nine.
|
||
|
||
Most modern quicksort implementations employ Tukey's "ninther" to select a
|
||
pivot from a classic C array.
|
||
Typical ninther implementations pick three unique triads of records, sort each
|
||
of the triads, and then sort the middle value of each triad to determine the
|
||
ninther value.
|
||
As stated previously, however, xfile accesses are not entirely cheap.
|
||
It turned out to be much more performant to read the nine elements into a
|
||
memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
|
||
the 4th element of that buffer as the pivot.
|
||
Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
|
||
low-effort robust (resistant) location in large samples`, in *Contributions to
|
||
Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
|
||
1978), pp. 251–257.
|
||
|
||
The partitioning of quicksort is fairly textbook -- rearrange the record
|
||
subset around the pivot, then set up the current and next stack frames to
|
||
sort with the larger and the smaller halves of the pivot, respectively.
|
||
This keeps the stack space requirements to log2(record count).
|
||
|
||
As a final performance optimization, the hi and lo scanning phase of quicksort
|
||
keeps examined xfile pages mapped in the kernel for as long as possible to
|
||
reduce map/unmap cycles.
|
||
Surprisingly, this reduces overall sort runtime by nearly half again after
|
||
accounting for the application of heapsort directly onto xfile pages.
|
||
|
||
.. _xfblob:
|
||
|
||
Blob Storage
|
||
````````````
|
||
|
||
Extended attributes and directories add an additional requirement for staging
|
||
records: arbitrary byte sequences of finite length.
|
||
Each directory entry record needs to store entry name,
|
||
and each extended attribute needs to store both the attribute name and value.
|
||
The names, keys, and values can consume a large amount of memory, so the
|
||
``xfblob`` abstraction was created to simplify management of these blobs
|
||
atop an xfile.
|
||
|
||
Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
|
||
and persist objects.
|
||
The store function returns a magic cookie for every object that it persists.
|
||
Later, callers provide this cookie to the ``xblob_load`` to recall the object.
|
||
The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
|
||
function frees them all because compaction is not needed.
|
||
|
||
The details of repairing directories and extended attributes will be discussed
|
||
in a subsequent section about atomic extent swapping.
|
||
However, it should be noted that these repair functions only use blob storage
|
||
to cache a small number of entries before adding them to a temporary ondisk
|
||
file, which is why compaction is not required.
|
||
|
||
The proposed patchset is at the start of the
|
||
`extended attribute repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
|
||
|
||
.. _xfbtree:
|
||
|
||
In-Memory B+Trees
|
||
`````````````````
|
||
|
||
The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
|
||
checking and repairing of secondary metadata commonly requires coordination
|
||
between a live metadata scan of the filesystem and writer threads that are
|
||
updating that metadata.
|
||
Keeping the scan data up to date requires requires the ability to propagate
|
||
metadata updates from the filesystem into the data being collected by the scan.
|
||
This *can* be done by appending concurrent updates into a separate log file and
|
||
applying them before writing the new metadata to disk, but this leads to
|
||
unbounded memory consumption if the rest of the system is very busy.
|
||
Another option is to skip the side-log and commit live updates from the
|
||
filesystem directly into the scan data, which trades more overhead for a lower
|
||
maximum memory requirement.
|
||
In both cases, the data structure holding the scan results must support indexed
|
||
access to perform well.
|
||
|
||
Given that indexed lookups of scan data is required for both strategies, online
|
||
fsck employs the second strategy of committing live updates directly into
|
||
scan data.
|
||
Because xfarrays are not indexed and do not enforce record ordering, they
|
||
are not suitable for this task.
|
||
Conveniently, however, XFS has a library to create and maintain ordered reverse
|
||
mapping records: the existing rmap btree code!
|
||
If only there was a means to create one in memory.
|
||
|
||
Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
|
||
regular file, which means that the kernel can create byte or block addressable
|
||
virtual address spaces at will.
|
||
The XFS buffer cache specializes in abstracting IO to block-oriented address
|
||
spaces, which means that adaptation of the buffer cache to interface with
|
||
xfiles enables reuse of the entire btree library.
|
||
Btrees built atop an xfile are collectively known as ``xfbtrees``.
|
||
The next few sections describe how they actually work.
|
||
|
||
The proposed patchset is the
|
||
`in-memory btree
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
|
||
series.
|
||
|
||
Using xfiles as a Buffer Cache Target
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Two modifications are necessary to support xfiles as a buffer cache target.
|
||
The first is to make it possible for the ``struct xfs_buftarg`` structure to
|
||
host the ``struct xfs_buf`` rhashtable, because normally those are held by a
|
||
per-AG structure.
|
||
The second change is to modify the buffer ``ioapply`` function to "read" cached
|
||
pages from the xfile and "write" cached pages back to the xfile.
|
||
Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
|
||
since the xfile does not provide any locking on its own.
|
||
With this adaptation in place, users of the xfile-backed buffer cache use
|
||
exactly the same APIs as users of the disk-backed buffer cache.
|
||
The separation between xfile and buffer cache implies higher memory usage since
|
||
they do not share pages, but this property could some day enable transactional
|
||
updates to an in-memory btree.
|
||
Today, however, it simply eliminates the need for new code.
|
||
|
||
Space Management with an xfbtree
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Space management for an xfile is very simple -- each btree block is one memory
|
||
page in size.
|
||
These blocks use the same header format as an on-disk btree, but the in-memory
|
||
block verifiers ignore the checksums, assuming that xfile memory is no more
|
||
corruption-prone than regular DRAM.
|
||
Reusing existing code here is more important than absolute memory efficiency.
|
||
|
||
The very first block of an xfile backing an xfbtree contains a header block.
|
||
The header describes the owner, height, and the block number of the root
|
||
xfbtree block.
|
||
|
||
To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
|
||
If there are no gaps, create one by extending the length of the xfile.
|
||
Preallocate space for the block with ``xfile_prealloc``, and hand back the
|
||
location.
|
||
To free an xfbtree block, use ``xfile_discard`` (which internally uses
|
||
``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
|
||
|
||
Populating an xfbtree
|
||
^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
An online fsck function that wants to create an xfbtree should proceed as
|
||
follows:
|
||
|
||
1. Call ``xfile_create`` to create an xfile.
|
||
|
||
2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
|
||
pointing to the xfile.
|
||
|
||
3. Pass the buffer cache target, buffer ops, and other information to
|
||
``xfbtree_create`` to write an initial tree header and root block to the
|
||
xfile.
|
||
Each btree type should define a wrapper that passes necessary arguments to
|
||
the creation function.
|
||
For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
|
||
all the necessary details for callers.
|
||
A ``struct xfbtree`` object will be returned.
|
||
|
||
4. Pass the xfbtree object to the btree cursor creation function for the
|
||
btree type.
|
||
Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
|
||
for callers.
|
||
|
||
5. Pass the btree cursor to the regular btree functions to make queries against
|
||
and to update the in-memory btree.
|
||
For example, a btree cursor for an rmap xfbtree can be passed to the
|
||
``xfs_rmap_*`` functions just like any other btree cursor.
|
||
See the :ref:`next section<xfbtree_commit>` for information on dealing with
|
||
xfbtree updates that are logged to a transaction.
|
||
|
||
6. When finished, delete the btree cursor, destroy the xfbtree object, free the
|
||
buffer target, and the destroy the xfile to release all resources.
|
||
|
||
.. _xfbtree_commit:
|
||
|
||
Committing Logged xfbtree Buffers
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Although it is a clever hack to reuse the rmap btree code to handle the staging
|
||
structure, the ephemeral nature of the in-memory btree block storage presents
|
||
some challenges of its own.
|
||
The XFS transaction manager must not commit buffer log items for buffers backed
|
||
by an xfile because the log format does not understand updates for devices
|
||
other than the data device.
|
||
An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
|
||
log transactions back into the filesystem, and certainly won't exist during
|
||
log recovery.
|
||
For these reasons, any code updating an xfbtree in transaction context must
|
||
remove the buffer log items from the transaction and write the updates into the
|
||
backing xfile before committing or cancelling the transaction.
|
||
|
||
The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
|
||
this functionality as follows:
|
||
|
||
1. Find each buffer log item whose buffer targets the xfile.
|
||
|
||
2. Record the dirty/ordered status of the log item.
|
||
|
||
3. Detach the log item from the buffer.
|
||
|
||
4. Queue the buffer to a special delwri list.
|
||
|
||
5. Clear the transaction dirty flag if the only dirty log items were the ones
|
||
that were detached in step 3.
|
||
|
||
6. Submit the delwri list to commit the changes to the xfile, if the updates
|
||
are being committed.
|
||
|
||
After removing xfile logged buffers from the transaction in this manner, the
|
||
transaction can be committed or cancelled.
|
||
|
||
Bulk Loading of Ondisk B+Trees
|
||
------------------------------
|
||
|
||
As mentioned previously, early iterations of online repair built new btree
|
||
structures by creating a new btree and adding observations individually.
|
||
Loading a btree one record at a time had a slight advantage of not requiring
|
||
the incore records to be sorted prior to commit, but was very slow and leaked
|
||
blocks if the system went down during a repair.
|
||
Loading records one at a time also meant that repair could not control the
|
||
loading factor of the blocks in the new btree.
|
||
|
||
Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
|
||
rebuilding a btree index from a collection of records -- bulk btree loading.
|
||
This was implemented rather inefficiently code-wise, since ``xfs_repair``
|
||
had separate copy-pasted implementations for each btree type.
|
||
|
||
To prepare for online fsck, each of the four bulk loaders were studied, notes
|
||
were taken, and the four were refactored into a single generic btree bulk
|
||
loading mechanism.
|
||
Those notes in turn have been refreshed and are presented below.
|
||
|
||
Geometry Computation
|
||
````````````````````
|
||
|
||
The zeroth step of bulk loading is to assemble the entire record set that will
|
||
be stored in the new btree, and sort the records.
|
||
Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
|
||
btree from the record set, the type of btree, and any load factor preferences.
|
||
This information is required for resource reservation.
|
||
|
||
First, the geometry computation computes the minimum and maximum records that
|
||
will fit in a leaf block from the size of a btree block and the size of the
|
||
block header.
|
||
Roughly speaking, the maximum number of records is::
|
||
|
||
maxrecs = (block_size - header_size) / record_size
|
||
|
||
The XFS design specifies that btree blocks should be merged when possible,
|
||
which means the minimum number of records is half of maxrecs::
|
||
|
||
minrecs = maxrecs / 2
|
||
|
||
The next variable to determine is the desired loading factor.
|
||
This must be at least minrecs and no more than maxrecs.
|
||
Choosing minrecs is undesirable because it wastes half the block.
|
||
Choosing maxrecs is also undesirable because adding a single record to each
|
||
newly rebuilt leaf block will cause a tree split, which causes a noticeable
|
||
drop in performance immediately afterwards.
|
||
The default loading factor was chosen to be 75% of maxrecs, which provides a
|
||
reasonably compact structure without any immediate split penalties::
|
||
|
||
default_load_factor = (maxrecs + minrecs) / 2
|
||
|
||
If space is tight, the loading factor will be set to maxrecs to try to avoid
|
||
running out of space::
|
||
|
||
leaf_load_factor = enough space ? default_load_factor : maxrecs
|
||
|
||
Load factor is computed for btree node blocks using the combined size of the
|
||
btree key and pointer as the record size::
|
||
|
||
maxrecs = (block_size - header_size) / (key_size + ptr_size)
|
||
minrecs = maxrecs / 2
|
||
node_load_factor = enough space ? default_load_factor : maxrecs
|
||
|
||
Once that's done, the number of leaf blocks required to store the record set
|
||
can be computed as::
|
||
|
||
leaf_blocks = ceil(record_count / leaf_load_factor)
|
||
|
||
The number of node blocks needed to point to the next level down in the tree
|
||
is computed as::
|
||
|
||
n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
|
||
node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
|
||
|
||
The entire computation is performed recursively until the current level only
|
||
needs one block.
|
||
The resulting geometry is as follows:
|
||
|
||
- For AG-rooted btrees, this level is the root level, so the height of the new
|
||
tree is ``level + 1`` and the space needed is the summation of the number of
|
||
blocks on each level.
|
||
|
||
- For inode-rooted btrees where the records in the top level do not fit in the
|
||
inode fork area, the height is ``level + 2``, the space needed is the
|
||
summation of the number of blocks on each level, and the inode fork points to
|
||
the root block.
|
||
|
||
- For inode-rooted btrees where the records in the top level can be stored in
|
||
the inode fork area, then the root block can be stored in the inode, the
|
||
height is ``level + 1``, and the space needed is one less than the summation
|
||
of the number of blocks on each level.
|
||
This only becomes relevant when non-bmap btrees gain the ability to root in
|
||
an inode, which is a future patchset and only included here for completeness.
|
||
|
||
.. _newbt:
|
||
|
||
Reserving New B+Tree Blocks
|
||
```````````````````````````
|
||
|
||
Once repair knows the number of blocks needed for the new btree, it allocates
|
||
those blocks using the free space information.
|
||
Each reserved extent is tracked separately by the btree builder state data.
|
||
To improve crash resilience, the reservation code also logs an Extent Freeing
|
||
Intent (EFI) item in the same transaction as each space allocation and attaches
|
||
its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
|
||
If the system goes down, log recovery will use the unfinished EFIs to free the
|
||
unused space, the free space, leaving the filesystem unchanged.
|
||
|
||
Each time the btree builder claims a block for the btree from a reserved
|
||
extent, it updates the in-memory reservation to reflect the claimed space.
|
||
Block reservation tries to allocate as much contiguous space as possible to
|
||
reduce the number of EFIs in play.
|
||
|
||
While repair is writing these new btree blocks, the EFIs created for the space
|
||
reservations pin the tail of the ondisk log.
|
||
It's possible that other parts of the system will remain busy and push the head
|
||
of the log towards the pinned tail.
|
||
To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
|
||
for too long.
|
||
To alleviate this problem, the dynamic relogging capability of the deferred ops
|
||
mechanism is reused here to commit a transaction at the log head containing an
|
||
EFD for the old EFI and new EFI at the head.
|
||
This enables the log to release the old EFI to keep the log moving forwards.
|
||
|
||
EFIs have a role to play during the commit and reaping phases; please see the
|
||
next section and the section about :ref:`reaping<reaping>` for more details.
|
||
|
||
Proposed patchsets are the
|
||
`bitmap rework
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
|
||
and the
|
||
`preparation for bulk loading btrees
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
|
||
|
||
|
||
Writing the New Tree
|
||
````````````````````
|
||
|
||
This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
|
||
a block from the reserved list, writes the new btree block header, fills the
|
||
rest of the block with records, and adds the new leaf block to a list of
|
||
written blocks::
|
||
|
||
┌────┐
|
||
│leaf│
|
||
│RRR │
|
||
└────┘
|
||
|
||
Sibling pointers are set every time a new block is added to the level::
|
||
|
||
┌────┐ ┌────┐ ┌────┐ ┌────┐
|
||
│leaf│→│leaf│→│leaf│→│leaf│
|
||
│RRR │←│RRR │←│RRR │←│RRR │
|
||
└────┘ └────┘ └────┘ └────┘
|
||
|
||
When it finishes writing the record leaf blocks, it moves on to the node
|
||
blocks
|
||
To fill a node block, it walks each block in the next level down in the tree
|
||
to compute the relevant keys and write them into the parent node::
|
||
|
||
┌────┐ ┌────┐
|
||
│node│──────→│node│
|
||
│PP │←──────│PP │
|
||
└────┘ └────┘
|
||
↙ ↘ ↙ ↘
|
||
┌────┐ ┌────┐ ┌────┐ ┌────┐
|
||
│leaf│→│leaf│→│leaf│→│leaf│
|
||
│RRR │←│RRR │←│RRR │←│RRR │
|
||
└────┘ └────┘ └────┘ └────┘
|
||
|
||
When it reaches the root level, it is ready to commit the new btree!::
|
||
|
||
┌─────────┐
|
||
│ root │
|
||
│ PP │
|
||
└─────────┘
|
||
↙ ↘
|
||
┌────┐ ┌────┐
|
||
│node│──────→│node│
|
||
│PP │←──────│PP │
|
||
└────┘ └────┘
|
||
↙ ↘ ↙ ↘
|
||
┌────┐ ┌────┐ ┌────┐ ┌────┐
|
||
│leaf│→│leaf│→│leaf│→│leaf│
|
||
│RRR │←│RRR │←│RRR │←│RRR │
|
||
└────┘ └────┘ └────┘ └────┘
|
||
|
||
The first step to commit the new btree is to persist the btree blocks to disk
|
||
synchronously.
|
||
This is a little complicated because a new btree block could have been freed
|
||
in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
|
||
remove the (stale) buffer from the AIL list before it can write the new blocks
|
||
to disk.
|
||
Blocks are queued for IO using a delwri list and written in one large batch
|
||
with ``xfs_buf_delwri_submit``.
|
||
|
||
Once the new blocks have been persisted to disk, control returns to the
|
||
individual repair function that called the bulk loader.
|
||
The repair function must log the location of the new root in a transaction,
|
||
clean up the space reservations that were made for the new btree, and reap the
|
||
old metadata blocks:
|
||
|
||
1. Commit the location of the new btree root.
|
||
|
||
2. For each incore reservation:
|
||
|
||
a. Log Extent Freeing Done (EFD) items for all the space that was consumed
|
||
by the btree builder. The new EFDs must point to the EFIs attached to
|
||
the reservation to prevent log recovery from freeing the new blocks.
|
||
|
||
b. For unclaimed portions of incore reservations, create a regular deferred
|
||
extent free work item to be free the unused space later in the
|
||
transaction chain.
|
||
|
||
c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
|
||
reservation of the committing transaction.
|
||
If the btree loading code suspects this might be about to happen, it must
|
||
call ``xrep_defer_finish`` to clear out the deferred work and obtain a
|
||
fresh transaction.
|
||
|
||
3. Clear out the deferred work a second time to finish the commit and clean
|
||
the repair transaction.
|
||
|
||
The transaction rolling in steps 2c and 3 represent a weakness in the repair
|
||
algorithm, because a log flush and a crash before the end of the reap step can
|
||
result in space leaking.
|
||
Online repair functions minimize the chances of this occurring by using very
|
||
large transactions, which each can accommodate many thousands of block freeing
|
||
instructions.
|
||
Repair moves on to reaping the old blocks, which will be presented in a
|
||
subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
|
||
|
||
Case Study: Rebuilding the Inode Index
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The high level process to rebuild the inode index btree is:
|
||
|
||
1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
|
||
records from the inode chunk information and a bitmap of the old inode btree
|
||
blocks.
|
||
|
||
2. Append the records to an xfarray in inode order.
|
||
|
||
3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
|
||
of blocks needed for the inode btree.
|
||
If the free space inode btree is enabled, call it again to estimate the
|
||
geometry of the finobt.
|
||
|
||
4. Allocate the number of blocks computed in the previous step.
|
||
|
||
5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
|
||
generate the internal node blocks.
|
||
If the free space inode btree is enabled, call it again to load the finobt.
|
||
|
||
6. Commit the location of the new btree root block(s) to the AGI.
|
||
|
||
7. Reap the old btree blocks using the bitmap created in step 1.
|
||
|
||
Details are as follows.
|
||
|
||
The inode btree maps inumbers to the ondisk location of the associated
|
||
inode records, which means that the inode btrees can be rebuilt from the
|
||
reverse mapping information.
|
||
Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
|
||
location of the old inode btree blocks.
|
||
Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
|
||
location of at least one inode cluster buffer.
|
||
A cluster is the smallest number of ondisk inodes that can be allocated or
|
||
freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
|
||
|
||
For the space represented by each inode cluster, ensure that there are no
|
||
records in the free space btrees nor any records in the reference count btree.
|
||
If there are, the space metadata inconsistencies are reason enough to abort the
|
||
operation.
|
||
Otherwise, read each cluster buffer to check that its contents appear to be
|
||
ondisk inodes and to decide if the file is allocated
|
||
(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
|
||
Accumulate the results of successive inode cluster buffer reads until there is
|
||
enough information to fill a single inode chunk record, which is 64 consecutive
|
||
numbers in the inumber keyspace.
|
||
If the chunk is sparse, the chunk record may include holes.
|
||
|
||
Once the repair function accumulates one chunk's worth of data, it calls
|
||
``xfarray_append`` to add the inode btree record to the xfarray.
|
||
This xfarray is walked twice during the btree creation step -- once to populate
|
||
the inode btree with all inode chunk records, and a second time to populate the
|
||
free inode btree with records for chunks that have free non-sparse inodes.
|
||
The number of records for the inode btree is the number of xfarray records,
|
||
but the record count for the free inode btree has to be computed as inode chunk
|
||
records are stored in the xfarray.
|
||
|
||
The proposed patchset is the
|
||
`AG btree repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
|
||
series.
|
||
|
||
Case Study: Rebuilding the Space Reference Counts
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Reverse mapping records are used to rebuild the reference count information.
|
||
Reference counts are required for correct operation of copy on write for shared
|
||
file data.
|
||
Imagine the reverse mapping entries as rectangles representing extents of
|
||
physical blocks, and that the rectangles can be laid down to allow them to
|
||
overlap each other.
|
||
From the diagram below, it is apparent that a reference count record must start
|
||
or end wherever the height of the stack changes.
|
||
In other words, the record emission stimulus is level-triggered::
|
||
|
||
█ ███
|
||
██ █████ ████ ███ ██████
|
||
██ ████ ███████████ ████ █████████
|
||
████████████████████████████████ ███████████
|
||
^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^
|
||
2 1 23 21 3 43 234 2123 1 01 2 3 0
|
||
|
||
The ondisk reference count btree does not store the refcount == 0 cases because
|
||
the free space btree already records which blocks are free.
|
||
Extents being used to stage copy-on-write operations should be the only records
|
||
with refcount == 1.
|
||
Single-owner file blocks aren't recorded in either the free space or the
|
||
reference count btrees.
|
||
|
||
The high level process to rebuild the reference count btree is:
|
||
|
||
1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
|
||
records for any space having more than one reverse mapping and add them to
|
||
the xfarray.
|
||
Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
|
||
because these are extents allocated to stage a copy on write operation and
|
||
are tracked in the refcount btree.
|
||
|
||
Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
|
||
refcount btree blocks.
|
||
|
||
2. Sort the records in physical extent order, putting the CoW staging extents
|
||
at the end of the xfarray.
|
||
This matches the sorting order of records in the refcount btree.
|
||
|
||
3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
|
||
of blocks needed for the new tree.
|
||
|
||
4. Allocate the number of blocks computed in the previous step.
|
||
|
||
5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
|
||
generate the internal node blocks.
|
||
|
||
6. Commit the location of new btree root block to the AGF.
|
||
|
||
7. Reap the old btree blocks using the bitmap created in step 1.
|
||
|
||
Details are as follows; the same algorithm is used by ``xfs_repair`` to
|
||
generate refcount information from reverse mapping records.
|
||
|
||
- Until the reverse mapping btree runs out of records:
|
||
|
||
- Retrieve the next record from the btree and put it in a bag.
|
||
|
||
- Collect all records with the same starting block from the btree and put
|
||
them in the bag.
|
||
|
||
- While the bag isn't empty:
|
||
|
||
- Among the mappings in the bag, compute the lowest block number where the
|
||
reference count changes.
|
||
This position will be either the starting block number of the next
|
||
unprocessed reverse mapping or the next block after the shortest mapping
|
||
in the bag.
|
||
|
||
- Remove all mappings from the bag that end at this position.
|
||
|
||
- Collect all reverse mappings that start at this position from the btree
|
||
and put them in the bag.
|
||
|
||
- If the size of the bag changed and is greater than one, create a new
|
||
refcount record associating the block number range that we just walked to
|
||
the size of the bag.
|
||
|
||
The bag-like structure in this case is a type 2 xfarray as discussed in the
|
||
:ref:`xfarray access patterns<xfarray_access_patterns>` section.
|
||
Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
|
||
removed via ``xfarray_unset``.
|
||
Bag members are examined through ``xfarray_iter`` loops.
|
||
|
||
The proposed patchset is the
|
||
`AG btree repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
|
||
series.
|
||
|
||
Case Study: Rebuilding File Fork Mapping Indices
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The high level process to rebuild a data/attr fork mapping btree is:
|
||
|
||
1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
|
||
records from the reverse mapping records for that inode and fork.
|
||
Append these records to an xfarray.
|
||
Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
|
||
records.
|
||
|
||
2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
|
||
of blocks needed for the new tree.
|
||
|
||
3. Sort the records in file offset order.
|
||
|
||
4. If the extent records would fit in the inode fork immediate area, commit the
|
||
records to that immediate area and skip to step 8.
|
||
|
||
5. Allocate the number of blocks computed in the previous step.
|
||
|
||
6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
|
||
generate the internal node blocks.
|
||
|
||
7. Commit the new btree root block to the inode fork immediate area.
|
||
|
||
8. Reap the old btree blocks using the bitmap created in step 1.
|
||
|
||
There are some complications here:
|
||
First, it's possible to move the fork offset to adjust the sizes of the
|
||
immediate areas if the data and attr forks are not both in BMBT format.
|
||
Second, if there are sufficiently few fork mappings, it may be possible to use
|
||
EXTENTS format instead of BMBT, which may require a conversion.
|
||
Third, the incore extent map must be reloaded carefully to avoid disturbing
|
||
any delayed allocation extents.
|
||
|
||
The proposed patchset is the
|
||
`file mapping repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
|
||
series.
|
||
|
||
.. _reaping:
|
||
|
||
Reaping Old Metadata Blocks
|
||
---------------------------
|
||
|
||
Whenever online fsck builds a new data structure to replace one that is
|
||
suspect, there is a question of how to find and dispose of the blocks that
|
||
belonged to the old structure.
|
||
The laziest method of course is not to deal with them at all, but this slowly
|
||
leads to service degradations as space leaks out of the filesystem.
|
||
Hopefully, someone will schedule a rebuild of the free space information to
|
||
plug all those leaks.
|
||
Offline repair rebuilds all space metadata after recording the usage of
|
||
the files and directories that it decides not to clear, hence it can build new
|
||
structures in the discovered free space and avoid the question of reaping.
|
||
|
||
As part of a repair, online fsck relies heavily on the reverse mapping records
|
||
to find space that is owned by the corresponding rmap owner yet truly free.
|
||
Cross referencing rmap records with other rmap records is necessary because
|
||
there may be other data structures that also think they own some of those
|
||
blocks (e.g. crosslinked trees).
|
||
Permitting the block allocator to hand them out again will not push the system
|
||
towards consistency.
|
||
|
||
For space metadata, the process of finding extents to dispose of generally
|
||
follows this format:
|
||
|
||
1. Create a bitmap of space used by data structures that must be preserved.
|
||
The space reservations used to create the new metadata can be used here if
|
||
the same rmap owner code is used to denote all of the objects being rebuilt.
|
||
|
||
2. Survey the reverse mapping data to create a bitmap of space owned by the
|
||
same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
|
||
|
||
3. Use the bitmap disunion operator to subtract (1) from (2).
|
||
The remaining set bits represent candidate extents that could be freed.
|
||
The process moves on to step 4 below.
|
||
|
||
Repairs for file-based metadata such as extended attributes, directories,
|
||
symbolic links, quota files and realtime bitmaps are performed by building a
|
||
new structure attached to a temporary file and swapping the forks.
|
||
Afterward, the mappings in the old file fork are the candidate blocks for
|
||
disposal.
|
||
|
||
The process for disposing of old extents is as follows:
|
||
|
||
4. For each candidate extent, count the number of reverse mapping records for
|
||
the first block in that extent that do not have the same rmap owner for the
|
||
data structure being repaired.
|
||
|
||
- If zero, the block has a single owner and can be freed.
|
||
|
||
- If not, the block is part of a crosslinked structure and must not be
|
||
freed.
|
||
|
||
5. Starting with the next block in the extent, figure out how many more blocks
|
||
have the same zero/nonzero other owner status as that first block.
|
||
|
||
6. If the region is crosslinked, delete the reverse mapping entry for the
|
||
structure being repaired and move on to the next region.
|
||
|
||
7. If the region is to be freed, mark any corresponding buffers in the buffer
|
||
cache as stale to prevent log writeback.
|
||
|
||
8. Free the region and move on.
|
||
|
||
However, there is one complication to this procedure.
|
||
Transactions are of finite size, so the reaping process must be careful to roll
|
||
the transactions to avoid overruns.
|
||
Overruns come from two sources:
|
||
|
||
a. EFIs logged on behalf of space that is no longer occupied
|
||
|
||
b. Log items for buffer invalidations
|
||
|
||
This is also a window in which a crash during the reaping process can leak
|
||
blocks.
|
||
As stated earlier, online repair functions use very large transactions to
|
||
minimize the chances of this occurring.
|
||
|
||
The proposed patchset is the
|
||
`preparation for bulk loading btrees
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
|
||
series.
|
||
|
||
Case Study: Reaping After a Regular Btree Repair
|
||
````````````````````````````````````````````````
|
||
|
||
Old reference count and inode btrees are the easiest to reap because they have
|
||
rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
|
||
btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
|
||
Creating a list of extents to reap the old btree blocks is quite simple,
|
||
conceptually:
|
||
|
||
1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
|
||
|
||
2. For each reverse mapping record with an rmap owner corresponding to the
|
||
metadata structure being rebuilt, set the corresponding range in a bitmap.
|
||
|
||
3. Walk the current data structures that have the same rmap owner.
|
||
For each block visited, clear that range in the above bitmap.
|
||
|
||
4. Each set bit in the bitmap represents a block that could be a block from the
|
||
old data structures and hence is a candidate for reaping.
|
||
In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
|
||
are the blocks that might be freeable.
|
||
|
||
If it is possible to maintain the AGF lock throughout the repair (which is the
|
||
common case), then step 2 can be performed at the same time as the reverse
|
||
mapping record walk that creates the records for the new btree.
|
||
|
||
Case Study: Rebuilding the Free Space Indices
|
||
`````````````````````````````````````````````
|
||
|
||
The high level process to rebuild the free space indices is:
|
||
|
||
1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
|
||
records from the gaps in the reverse mapping btree.
|
||
|
||
2. Append the records to an xfarray.
|
||
|
||
3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
|
||
of blocks needed for each new tree.
|
||
|
||
4. Allocate the number of blocks computed in the previous step from the free
|
||
space information collected.
|
||
|
||
5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
|
||
generate the internal node blocks for the free space by length index.
|
||
Call it again for the free space by block number index.
|
||
|
||
6. Commit the locations of the new btree root blocks to the AGF.
|
||
|
||
7. Reap the old btree blocks by looking for space that is not recorded by the
|
||
reverse mapping btree, the new free space btrees, or the AGFL.
|
||
|
||
Repairing the free space btrees has three key complications over a regular
|
||
btree repair:
|
||
|
||
First, free space is not explicitly tracked in the reverse mapping records.
|
||
Hence, the new free space records must be inferred from gaps in the physical
|
||
space component of the keyspace of the reverse mapping btree.
|
||
|
||
Second, free space repairs cannot use the common btree reservation code because
|
||
new blocks are reserved out of the free space btrees.
|
||
This is impossible when repairing the free space btrees themselves.
|
||
However, repair holds the AGF buffer lock for the duration of the free space
|
||
index reconstruction, so it can use the collected free space information to
|
||
supply the blocks for the new free space btrees.
|
||
It is not necessary to back each reserved extent with an EFI because the new
|
||
free space btrees are constructed in what the ondisk filesystem thinks is
|
||
unowned space.
|
||
However, if reserving blocks for the new btrees from the collected free space
|
||
information changes the number of free space records, repair must re-estimate
|
||
the new free space btree geometry with the new record count until the
|
||
reservation is sufficient.
|
||
As part of committing the new btrees, repair must ensure that reverse mappings
|
||
are created for the reserved blocks and that unused reserved blocks are
|
||
inserted into the free space btrees.
|
||
Deferrred rmap and freeing operations are used to ensure that this transition
|
||
is atomic, similar to the other btree repair functions.
|
||
|
||
Third, finding the blocks to reap after the repair is not overly
|
||
straightforward.
|
||
Blocks for the free space btrees and the reverse mapping btrees are supplied by
|
||
the AGFL.
|
||
Blocks put onto the AGFL have reverse mapping records with the owner
|
||
``XFS_RMAP_OWN_AG``.
|
||
This ownership is retained when blocks move from the AGFL into the free space
|
||
btrees or the reverse mapping btrees.
|
||
When repair walks reverse mapping records to synthesize free space records, it
|
||
creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
|
||
``XFS_RMAP_OWN_AG`` records.
|
||
The repair context maintains a second bitmap corresponding to the rmap btree
|
||
blocks and the AGFL blocks (``rmap_agfl_bitmap``).
|
||
When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
|
||
~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
|
||
btrees.
|
||
These blocks can then be reaped using the methods outlined above.
|
||
|
||
The proposed patchset is the
|
||
`AG btree repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
|
||
series.
|
||
|
||
.. _rmap_reap:
|
||
|
||
Case Study: Reaping After Repairing Reverse Mapping Btrees
|
||
``````````````````````````````````````````````````````````
|
||
|
||
Old reverse mapping btrees are less difficult to reap after a repair.
|
||
As mentioned in the previous section, blocks on the AGFL, the two free space
|
||
btree blocks, and the reverse mapping btree blocks all have reverse mapping
|
||
records with ``XFS_RMAP_OWN_AG`` as the owner.
|
||
The full process of gathering reverse mapping records and building a new btree
|
||
are described in the case study of
|
||
:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
|
||
discussion is that the new rmap btree will not contain any records for the old
|
||
rmap btree, nor will the old btree blocks be tracked in the free space btrees.
|
||
The list of candidate reaping blocks is computed by setting the bits
|
||
corresponding to the gaps in the new rmap btree records, and then clearing the
|
||
bits corresponding to extents in the free space btrees and the current AGFL
|
||
blocks.
|
||
The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
|
||
methods outlined above.
|
||
|
||
The rest of the process of rebuildng the reverse mapping btree is discussed
|
||
in a separate :ref:`case study<rmap_repair>`.
|
||
|
||
The proposed patchset is the
|
||
`AG btree repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
|
||
series.
|
||
|
||
Case Study: Rebuilding the AGFL
|
||
```````````````````````````````
|
||
|
||
The allocation group free block list (AGFL) is repaired as follows:
|
||
|
||
1. Create a bitmap for all the space that the reverse mapping data claims is
|
||
owned by ``XFS_RMAP_OWN_AG``.
|
||
|
||
2. Subtract the space used by the two free space btrees and the rmap btree.
|
||
|
||
3. Subtract any space that the reverse mapping data claims is owned by any
|
||
other owner, to avoid re-adding crosslinked blocks to the AGFL.
|
||
|
||
4. Once the AGFL is full, reap any blocks leftover.
|
||
|
||
5. The next operation to fix the freelist will right-size the list.
|
||
|
||
See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
|
||
|
||
Inode Record Repairs
|
||
--------------------
|
||
|
||
Inode records must be handled carefully, because they have both ondisk records
|
||
("dinodes") and an in-memory ("cached") representation.
|
||
There is a very high potential for cache coherency issues if online fsck is not
|
||
careful to access the ondisk metadata *only* when the ondisk metadata is so
|
||
badly damaged that the filesystem cannot load the in-memory representation.
|
||
When online fsck wants to open a damaged file for scrubbing, it must use
|
||
specialized resource acquisition functions that return either the in-memory
|
||
representation *or* a lock on whichever object is necessary to prevent any
|
||
update to the ondisk location.
|
||
|
||
The only repairs that should be made to the ondisk inode buffers are whatever
|
||
is necessary to get the in-core structure loaded.
|
||
This means fixing whatever is caught by the inode cluster buffer and inode fork
|
||
verifiers, and retrying the ``iget`` operation.
|
||
If the second ``iget`` fails, the repair has failed.
|
||
|
||
Once the in-memory representation is loaded, repair can lock the inode and can
|
||
subject it to comprehensive checks, repairs, and optimizations.
|
||
Most inode attributes are easy to check and constrain, or are user-controlled
|
||
arbitrary bit patterns; these are both easy to fix.
|
||
Dealing with the data and attr fork extent counts and the file block counts is
|
||
more complicated, because computing the correct value requires traversing the
|
||
forks, or if that fails, leaving the fields invalid and waiting for the fork
|
||
fsck functions to run.
|
||
|
||
The proposed patchset is the
|
||
`inode
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
|
||
repair series.
|
||
|
||
Quota Record Repairs
|
||
--------------------
|
||
|
||
Similar to inodes, quota records ("dquots") also have both ondisk records and
|
||
an in-memory representation, and hence are subject to the same cache coherency
|
||
issues.
|
||
Somewhat confusingly, both are known as dquots in the XFS codebase.
|
||
|
||
The only repairs that should be made to the ondisk quota record buffers are
|
||
whatever is necessary to get the in-core structure loaded.
|
||
Once the in-memory representation is loaded, the only attributes needing
|
||
checking are obviously bad limits and timer values.
|
||
|
||
Quota usage counters are checked, repaired, and discussed separately in the
|
||
section about :ref:`live quotacheck <quotacheck>`.
|
||
|
||
The proposed patchset is the
|
||
`quota
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
|
||
repair series.
|
||
|
||
.. _fscounters:
|
||
|
||
Freezing to Fix Summary Counters
|
||
--------------------------------
|
||
|
||
Filesystem summary counters track availability of filesystem resources such
|
||
as free blocks, free inodes, and allocated inodes.
|
||
This information could be compiled by walking the free space and inode indexes,
|
||
but this is a slow process, so XFS maintains a copy in the ondisk superblock
|
||
that should reflect the ondisk metadata, at least when the filesystem has been
|
||
unmounted cleanly.
|
||
For performance reasons, XFS also maintains incore copies of those counters,
|
||
which are key to enabling resource reservations for active transactions.
|
||
Writer threads reserve the worst-case quantities of resources from the
|
||
incore counter and give back whatever they don't use at commit time.
|
||
It is therefore only necessary to serialize on the superblock when the
|
||
superblock is being committed to disk.
|
||
|
||
The lazy superblock counter feature introduced in XFS v5 took this even further
|
||
by training log recovery to recompute the summary counters from the AG headers,
|
||
which eliminated the need for most transactions even to touch the superblock.
|
||
The only time XFS commits the summary counters is at filesystem unmount.
|
||
To reduce contention even further, the incore counter is implemented as a
|
||
percpu counter, which means that each CPU is allocated a batch of blocks from a
|
||
global incore counter and can satisfy small allocations from the local batch.
|
||
|
||
The high-performance nature of the summary counters makes it difficult for
|
||
online fsck to check them, since there is no way to quiesce a percpu counter
|
||
while the system is running.
|
||
Although online fsck can read the filesystem metadata to compute the correct
|
||
values of the summary counters, there's no way to hold the value of a percpu
|
||
counter stable, so it's quite possible that the counter will be out of date by
|
||
the time the walk is complete.
|
||
Earlier versions of online scrub would return to userspace with an incomplete
|
||
scan flag, but this is not a satisfying outcome for a system administrator.
|
||
For repairs, the in-memory counters must be stabilized while walking the
|
||
filesystem metadata to get an accurate reading and install it in the percpu
|
||
counter.
|
||
|
||
To satisfy this requirement, online fsck must prevent other programs in the
|
||
system from initiating new writes to the filesystem, it must disable background
|
||
garbage collection threads, and it must wait for existing writer programs to
|
||
exit the kernel.
|
||
Once that has been established, scrub can walk the AG free space indexes, the
|
||
inode btrees, and the realtime bitmap to compute the correct value of all
|
||
four summary counters.
|
||
This is very similar to a filesystem freeze, though not all of the pieces are
|
||
necessary:
|
||
|
||
- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
|
||
prevent other threads from thawing the filesystem, or other scrub threads
|
||
from initiating another fscounters freeze.
|
||
|
||
- It does not quiesce the log.
|
||
|
||
With this code in place, it is now possible to pause the filesystem for just
|
||
long enough to check and correct the summary counters.
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Historical Sidebar**: |
|
||
+--------------------------------------------------------------------------+
|
||
| The initial implementation used the actual VFS filesystem freeze |
|
||
| mechanism to quiesce filesystem activity. |
|
||
| With the filesystem frozen, it is possible to resolve the counter values |
|
||
| with exact precision, but there are many problems with calling the VFS |
|
||
| methods directly: |
|
||
| |
|
||
| - Other programs can unfreeze the filesystem without our knowledge. |
|
||
| This leads to incorrect scan results and incorrect repairs. |
|
||
| |
|
||
| - Adding an extra lock to prevent others from thawing the filesystem |
|
||
| required the addition of a ``->freeze_super`` function to wrap |
|
||
| ``freeze_fs()``. |
|
||
| This in turn caused other subtle problems because it turns out that |
|
||
| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the |
|
||
| last reference to the VFS superblock, and any subsequent access |
|
||
| becomes a UAF bug! |
|
||
| This can happen if the filesystem is unmounted while the underlying |
|
||
| block device has frozen the filesystem. |
|
||
| This problem could be solved by grabbing extra references to the |
|
||
| superblock, but it felt suboptimal given the other inadequacies of |
|
||
| this approach. |
|
||
| |
|
||
| - The log need not be quiesced to check the summary counters, but a VFS |
|
||
| freeze initiates one anyway. |
|
||
| This adds unnecessary runtime to live fscounter fsck operations. |
|
||
| |
|
||
| - Quiescing the log means that XFS flushes the (possibly incorrect) |
|
||
| counters to disk as part of cleaning the log. |
|
||
| |
|
||
| - A bug in the VFS meant that freeze could complete even when |
|
||
| sync_filesystem fails to flush the filesystem and returns an error. |
|
||
| This bug was fixed in Linux 5.17. |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
The proposed patchset is the
|
||
`summary counter cleanup
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
|
||
series.
|
||
|
||
Full Filesystem Scans
|
||
---------------------
|
||
|
||
Certain types of metadata can only be checked by walking every file in the
|
||
entire filesystem to record observations and comparing the observations against
|
||
what's recorded on disk.
|
||
Like every other type of online repair, repairs are made by writing those
|
||
observations to disk in a replacement structure and committing it atomically.
|
||
However, it is not practical to shut down the entire filesystem to examine
|
||
hundreds of billions of files because the downtime would be excessive.
|
||
Therefore, online fsck must build the infrastructure to manage a live scan of
|
||
all the files in the filesystem.
|
||
There are two questions that need to be solved to perform a live walk:
|
||
|
||
- How does scrub manage the scan while it is collecting data?
|
||
|
||
- How does the scan keep abreast of changes being made to the system by other
|
||
threads?
|
||
|
||
.. _iscan:
|
||
|
||
Coordinated Inode Scans
|
||
```````````````````````
|
||
|
||
In the original Unix filesystems of the 1970s, each directory entry contained
|
||
an index number (*inumber*) which was used as an index into on ondisk array
|
||
(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
|
||
its data block mapping.
|
||
This system is described by J. Lions, `"inode (5659)"
|
||
<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
|
||
UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
|
||
Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
|
||
`"Implementation of the File System"
|
||
<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
|
||
Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
|
||
1913-4.
|
||
|
||
XFS retains most of this design, except now inumbers are search keys over all
|
||
the space in the data section filesystem.
|
||
They form a continuous keyspace that can be expressed as a 64-bit integer,
|
||
though the inodes themselves are sparsely distributed within the keyspace.
|
||
Scans proceed in a linear fashion across the inumber keyspace, starting from
|
||
``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
|
||
Naturally, a scan through a keyspace requires a scan cursor object to track the
|
||
scan progress.
|
||
Because this keyspace is sparse, this cursor contains two parts.
|
||
The first part of this scan cursor object tracks the inode that will be
|
||
examined next; call this the examination cursor.
|
||
Somewhat less obviously, the scan cursor object must also track which parts of
|
||
the keyspace have already been visited, which is critical for deciding if a
|
||
concurrent filesystem update needs to be incorporated into the scan data.
|
||
Call this the visited inode cursor.
|
||
|
||
Advancing the scan cursor is a multi-step process encapsulated in
|
||
``xchk_iscan_iter``:
|
||
|
||
1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
|
||
inode cursor.
|
||
This guarantee that inodes in this AG cannot be allocated or freed while
|
||
advancing the cursor.
|
||
|
||
2. Use the per-AG inode btree to look up the next inumber after the one that
|
||
was just visited, since it may not be keyspace adjacent.
|
||
|
||
3. If there are no more inodes left in this AG:
|
||
|
||
a. Move the examination cursor to the point of the inumber keyspace that
|
||
corresponds to the start of the next AG.
|
||
|
||
b. Adjust the visited inode cursor to indicate that it has "visited" the
|
||
last possible inode in the current AG's inode keyspace.
|
||
XFS inumbers are segmented, so the cursor needs to be marked as having
|
||
visited the entire keyspace up to just before the start of the next AG's
|
||
inode keyspace.
|
||
|
||
c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
|
||
filesystem.
|
||
|
||
d. If there are no more AGs to examine, set both cursors to the end of the
|
||
inumber keyspace.
|
||
The scan is now complete.
|
||
|
||
4. Otherwise, there is at least one more inode to scan in this AG:
|
||
|
||
a. Move the examination cursor ahead to the next inode marked as allocated
|
||
by the inode btree.
|
||
|
||
b. Adjust the visited inode cursor to point to the inode just prior to where
|
||
the examination cursor is now.
|
||
Because the scanner holds the AGI buffer lock, no inodes could have been
|
||
created in the part of the inode keyspace that the visited inode cursor
|
||
just advanced.
|
||
|
||
5. Get the incore inode for the inumber of the examination cursor.
|
||
By maintaining the AGI buffer lock until this point, the scanner knows that
|
||
it was safe to advance the examination cursor across the entire keyspace,
|
||
and that it has stabilized this next inode so that it cannot disappear from
|
||
the filesystem until the scan releases the incore inode.
|
||
|
||
6. Drop the AGI lock and return the incore inode to the caller.
|
||
|
||
Online fsck functions scan all files in the filesystem as follows:
|
||
|
||
1. Start a scan by calling ``xchk_iscan_start``.
|
||
|
||
2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
|
||
If one is provided:
|
||
|
||
a. Lock the inode to prevent updates during the scan.
|
||
|
||
b. Scan the inode.
|
||
|
||
c. While still holding the inode lock, adjust the visited inode cursor
|
||
(``xchk_iscan_mark_visited``) to point to this inode.
|
||
|
||
d. Unlock and release the inode.
|
||
|
||
8. Call ``xchk_iscan_teardown`` to complete the scan.
|
||
|
||
There are subtleties with the inode cache that complicate grabbing the incore
|
||
inode for the caller.
|
||
Obviously, it is an absolute requirement that the inode metadata be consistent
|
||
enough to load it into the inode cache.
|
||
Second, if the incore inode is stuck in some intermediate state, the scan
|
||
coordinator must release the AGI and push the main filesystem to get the inode
|
||
back into a loadable state.
|
||
|
||
The proposed patches are the
|
||
`inode scanner
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
|
||
series.
|
||
The first user of the new functionality is the
|
||
`online quotacheck
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
|
||
series.
|
||
|
||
Inode Management
|
||
````````````````
|
||
|
||
In regular filesystem code, references to allocated XFS incore inodes are
|
||
always obtained (``xfs_iget``) outside of transaction context because the
|
||
creation of the incore context for an existing file does not require metadata
|
||
updates.
|
||
However, it is important to note that references to incore inodes obtained as
|
||
part of file creation must be performed in transaction context because the
|
||
filesystem must ensure the atomicity of the ondisk inode btree index updates
|
||
and the initialization of the actual ondisk inode.
|
||
|
||
References to incore inodes are always released (``xfs_irele``) outside of
|
||
transaction context because there are a handful of activities that might
|
||
require ondisk updates:
|
||
|
||
- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
|
||
release.
|
||
|
||
- Speculative preallocations need to be unreserved.
|
||
|
||
- An unlinked file may have lost its last reference, in which case the entire
|
||
file must be inactivated, which involves releasing all of its resources in
|
||
the ondisk metadata and freeing the inode.
|
||
|
||
These activities are collectively called inode inactivation.
|
||
Inactivation has two parts -- the VFS part, which initiates writeback on all
|
||
dirty file pages, and the XFS part, which cleans up XFS-specific information
|
||
and frees the inode if it was unlinked.
|
||
If the inode is unlinked (or unconnected after a file handle operation), the
|
||
kernel drops the inode into the inactivation machinery immediately.
|
||
|
||
During normal operation, resource acquisition for an update follows this order
|
||
to avoid deadlocks:
|
||
|
||
1. Inode reference (``iget``).
|
||
|
||
2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
|
||
|
||
3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
|
||
|
||
4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
|
||
can update page cache mappings.
|
||
|
||
5. Log feature enablement.
|
||
|
||
6. Transaction log space grant.
|
||
|
||
7. Space on the data and realtime devices for the transaction.
|
||
|
||
8. Incore dquot references, if a file is being repaired.
|
||
Note that they are not locked, merely acquired.
|
||
|
||
9. Inode ``ILOCK`` for file metadata updates.
|
||
|
||
10. AG header buffer locks / Realtime metadata inode ILOCK.
|
||
|
||
11. Realtime metadata buffer locks, if applicable.
|
||
|
||
12. Extent mapping btree blocks, if applicable.
|
||
|
||
Resources are often released in the reverse order, though this is not required.
|
||
However, online fsck differs from regular XFS operations because it may examine
|
||
an object that normally is acquired in a later stage of the locking order, and
|
||
then decide to cross-reference the object with an object that is acquired
|
||
earlier in the order.
|
||
The next few sections detail the specific ways in which online fsck takes care
|
||
to avoid deadlocks.
|
||
|
||
iget and irele During a Scrub
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
An inode scan performed on behalf of a scrub operation runs in transaction
|
||
context, and possibly with resources already locked and bound to it.
|
||
This isn't much of a problem for ``iget`` since it can operate in the context
|
||
of an existing transaction, as long as all of the bound resources are acquired
|
||
before the inode reference in the regular filesystem.
|
||
|
||
When the VFS ``iput`` function is given a linked inode with no other
|
||
references, it normally puts the inode on an LRU list in the hope that it can
|
||
save time if another process re-opens the file before the system runs out
|
||
of memory and frees it.
|
||
Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
|
||
flag on the inode to cause the kernel to try to drop the inode into the
|
||
inactivation machinery immediately.
|
||
|
||
In the past, inactivation was always done from the process that dropped the
|
||
inode, which was a problem for scrub because scrub may already hold a
|
||
transaction, and XFS does not support nesting transactions.
|
||
On the other hand, if there is no scrub transaction, it is desirable to drop
|
||
otherwise unused inodes immediately to avoid polluting caches.
|
||
To capture these nuances, the online fsck code has a separate ``xchk_irele``
|
||
function to set or clear the ``DONTCACHE`` flag to get the required release
|
||
behavior.
|
||
|
||
Proposed patchsets include fixing
|
||
`scrub iget usage
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
|
||
`dir iget usage
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
|
||
|
||
.. _ilocking:
|
||
|
||
Locking Inodes
|
||
^^^^^^^^^^^^^^
|
||
|
||
In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
|
||
in a well-known order: parent → child when updating the directory tree, and
|
||
in numerical order of the addresses of their ``struct inode`` object otherwise.
|
||
For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
|
||
faults.
|
||
If two MMAPLOCKs must be acquired, they are acquired in numerical order of
|
||
the addresses of their ``struct address_space`` objects.
|
||
Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
|
||
acquired before transactions are allocated.
|
||
If two ILOCKs must be acquired, they are acquired in inumber order.
|
||
|
||
Inode lock acquisition must be done carefully during a coordinated inode scan.
|
||
Online fsck cannot abide these conventions, because for a directory tree
|
||
scanner, the scrub process holds the IOLOCK of the file being scanned and it
|
||
needs to take the IOLOCK of the file at the other end of the directory link.
|
||
If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
|
||
cannot use the regular inode locking functions and avoid becoming trapped in an
|
||
ABBA deadlock.
|
||
|
||
Solving both of these problems is straightforward -- any time online fsck
|
||
needs to take a second lock of the same class, it uses trylock to avoid an ABBA
|
||
deadlock.
|
||
If the trylock fails, scrub drops all inode locks and use trylock loops to
|
||
(re)acquire all necessary resources.
|
||
Trylock loops enable scrub to check for pending fatal signals, which is how
|
||
scrub avoids deadlocking the filesystem or becoming an unresponsive process.
|
||
However, trylock loops means that online fsck must be prepared to measure the
|
||
resource being scrubbed before and after the lock cycle to detect changes and
|
||
react accordingly.
|
||
|
||
.. _dirparent:
|
||
|
||
Case Study: Finding a Directory Parent
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Consider the directory parent pointer repair code as an example.
|
||
Online fsck must verify that the dotdot dirent of a directory points up to a
|
||
parent directory, and that the parent directory contains exactly one dirent
|
||
pointing down to the child directory.
|
||
Fully validating this relationship (and repairing it if possible) requires a
|
||
walk of every directory on the filesystem while holding the child locked, and
|
||
while updates to the directory tree are being made.
|
||
The coordinated inode scan provides a way to walk the filesystem without the
|
||
possibility of missing an inode.
|
||
The child directory is kept locked to prevent updates to the dotdot dirent, but
|
||
if the scanner fails to lock a parent, it can drop and relock both the child
|
||
and the prospective parent.
|
||
If the dotdot entry changes while the directory is unlocked, then a move or
|
||
rename operation must have changed the child's parentage, and the scan can
|
||
exit early.
|
||
|
||
The proposed patchset is the
|
||
`directory repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
|
||
series.
|
||
|
||
.. _fshooks:
|
||
|
||
Filesystem Hooks
|
||
`````````````````
|
||
|
||
The second piece of support that online fsck functions need during a full
|
||
filesystem scan is the ability to stay informed about updates being made by
|
||
other threads in the filesystem, since comparisons against the past are useless
|
||
in a dynamic environment.
|
||
Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
|
||
filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
|
||
|
||
Filesystem hooks convey information about an ongoing filesystem operation to
|
||
a downstream consumer.
|
||
In this case, the downstream consumer is always an online fsck function.
|
||
Because multiple fsck functions can run in parallel, online fsck uses the Linux
|
||
notifier call chain facility to dispatch updates to any number of interested
|
||
fsck processes.
|
||
Call chains are a dynamic list, which means that they can be configured at
|
||
run time.
|
||
Because these hooks are private to the XFS module, the information passed along
|
||
contains exactly what the checking function needs to update its observations.
|
||
|
||
The current implementation of XFS hooks uses SRCU notifier chains to reduce the
|
||
impact to highly threaded workloads.
|
||
Regular blocking notifier chains use a rwsem and seem to have a much lower
|
||
overhead for single-threaded applications.
|
||
However, it may turn out that the combination of blocking chains and static
|
||
keys are a more performant combination; more study is needed here.
|
||
|
||
The following pieces are necessary to hook a certain point in the filesystem:
|
||
|
||
- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
|
||
a well-known incore filesystem object.
|
||
|
||
- Each hook must define an action code and a structure containing more context
|
||
about the action.
|
||
|
||
- Hook providers should provide appropriate wrapper functions and structs
|
||
around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
|
||
checking to ensure correct usage.
|
||
|
||
- A callsite in the regular filesystem code must be chosen to call
|
||
``xfs_hooks_call`` with the action code and data structure.
|
||
This place should be adjacent to (and not earlier than) the place where
|
||
the filesystem update is committed to the transaction.
|
||
In general, when the filesystem calls a hook chain, it should be able to
|
||
handle sleeping and should not be vulnerable to memory reclaim or locking
|
||
recursion.
|
||
However, the exact requirements are very dependent on the context of the hook
|
||
caller and the callee.
|
||
|
||
- The online fsck function should define a structure to hold scan data, a lock
|
||
to coordinate access to the scan data, and a ``struct xfs_hook`` object.
|
||
The scanner function and the regular filesystem code must acquire resources
|
||
in the same order; see the next section for details.
|
||
|
||
- The online fsck code must contain a C function to catch the hook action code
|
||
and data structure.
|
||
If the object being updated has already been visited by the scan, then the
|
||
hook information must be applied to the scan data.
|
||
|
||
- Prior to unlocking inodes to start the scan, online fsck must call
|
||
``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
|
||
``xfs_hooks_add`` to enable the hook.
|
||
|
||
- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
|
||
complete.
|
||
|
||
The number of hooks should be kept to a minimum to reduce complexity.
|
||
Static keys are used to reduce the overhead of filesystem hooks to nearly
|
||
zero when online fsck is not running.
|
||
|
||
.. _liveupdate:
|
||
|
||
Live Updates During a Scan
|
||
``````````````````````````
|
||
|
||
The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
|
||
filesystem code look like this::
|
||
|
||
other program
|
||
↓
|
||
inode lock ←────────────────────┐
|
||
↓ │
|
||
AG header lock │
|
||
↓ │
|
||
filesystem function │
|
||
↓ │
|
||
notifier call chain │ same
|
||
↓ ├─── inode
|
||
scrub hook function │ lock
|
||
↓ │
|
||
scan data mutex ←──┐ same │
|
||
↓ ├─── scan │
|
||
update scan data │ lock │
|
||
↑ │ │
|
||
scan data mutex ←──┘ │
|
||
↑ │
|
||
inode lock ←────────────────────┘
|
||
↑
|
||
scrub function
|
||
↑
|
||
inode scanner
|
||
↑
|
||
xfs_scrub
|
||
|
||
These rules must be followed to ensure correct interactions between the
|
||
checking code and the code making an update to the filesystem:
|
||
|
||
- Prior to invoking the notifier call chain, the filesystem function being
|
||
hooked must acquire the same lock that the scrub scanning function acquires
|
||
to scan the inode.
|
||
|
||
- The scanning function and the scrub hook function must coordinate access to
|
||
the scan data by acquiring a lock on the scan data.
|
||
|
||
- Scrub hook function must not add the live update information to the scan
|
||
observations unless the inode being updated has already been scanned.
|
||
The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
|
||
for this.
|
||
|
||
- Scrub hook functions must not change the caller's state, including the
|
||
transaction that it is running.
|
||
They must not acquire any resources that might conflict with the filesystem
|
||
function being hooked.
|
||
|
||
- The hook function can abort the inode scan to avoid breaking the other rules.
|
||
|
||
The inode scan APIs are pretty simple:
|
||
|
||
- ``xchk_iscan_start`` starts a scan
|
||
|
||
- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
|
||
returns zero if there is nothing left to scan
|
||
|
||
- ``xchk_iscan_want_live_update`` to decide if an inode has already been
|
||
visited in the scan.
|
||
This is critical for hook functions to decide if they need to update the
|
||
in-memory scan information.
|
||
|
||
- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
|
||
scan
|
||
|
||
- ``xchk_iscan_teardown`` to finish the scan
|
||
|
||
This functionality is also a part of the
|
||
`inode scanner
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
|
||
series.
|
||
|
||
.. _quotacheck:
|
||
|
||
Case Study: Quota Counter Checking
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
It is useful to compare the mount time quotacheck code to the online repair
|
||
quotacheck code.
|
||
Mount time quotacheck does not have to contend with concurrent operations, so
|
||
it does the following:
|
||
|
||
1. Make sure the ondisk dquots are in good enough shape that all the incore
|
||
dquots will actually load, and zero the resource usage counters in the
|
||
ondisk buffer.
|
||
|
||
2. Walk every inode in the filesystem.
|
||
Add each file's resource usage to the incore dquot.
|
||
|
||
3. Walk each incore dquot.
|
||
If the incore dquot is not being flushed, add the ondisk buffer backing the
|
||
incore dquot to a delayed write (delwri) list.
|
||
|
||
4. Write the buffer list to disk.
|
||
|
||
Like most online fsck functions, online quotacheck can't write to regular
|
||
filesystem objects until the newly collected metadata reflect all filesystem
|
||
state.
|
||
Therefore, online quotacheck records file resource usage to a shadow dquot
|
||
index implemented with a sparse ``xfarray``, and only writes to the real dquots
|
||
once the scan is complete.
|
||
Handling transactional updates is tricky because quota resource usage updates
|
||
are handled in phases to minimize contention on dquots:
|
||
|
||
1. The inodes involved are joined and locked to a transaction.
|
||
|
||
2. For each dquot attached to the file:
|
||
|
||
a. The dquot is locked.
|
||
|
||
b. A quota reservation is added to the dquot's resource usage.
|
||
The reservation is recorded in the transaction.
|
||
|
||
c. The dquot is unlocked.
|
||
|
||
3. Changes in actual quota usage are tracked in the transaction.
|
||
|
||
4. At transaction commit time, each dquot is examined again:
|
||
|
||
a. The dquot is locked again.
|
||
|
||
b. Quota usage changes are logged and unused reservation is given back to
|
||
the dquot.
|
||
|
||
c. The dquot is unlocked.
|
||
|
||
For online quotacheck, hooks are placed in steps 2 and 4.
|
||
The step 2 hook creates a shadow version of the transaction dquot context
|
||
(``dqtrx``) that operates in a similar manner to the regular code.
|
||
The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
|
||
Notice that both hooks are called with the inode locked, which is how the
|
||
live update coordinates with the inode scanner.
|
||
|
||
The quotacheck scan looks like this:
|
||
|
||
1. Set up a coordinated inode scan.
|
||
|
||
2. For each inode returned by the inode scan iterator:
|
||
|
||
a. Grab and lock the inode.
|
||
|
||
b. Determine that inode's resource usage (data blocks, inode counts,
|
||
realtime blocks) and add that to the shadow dquots for the user, group,
|
||
and project ids associated with the inode.
|
||
|
||
c. Unlock and release the inode.
|
||
|
||
3. For each dquot in the system:
|
||
|
||
a. Grab and lock the dquot.
|
||
|
||
b. Check the dquot against the shadow dquots created by the scan and updated
|
||
by the live hooks.
|
||
|
||
Live updates are key to being able to walk every quota record without
|
||
needing to hold any locks for a long duration.
|
||
If repairs are desired, the real and shadow dquots are locked and their
|
||
resource counts are set to the values in the shadow dquot.
|
||
|
||
The proposed patchset is the
|
||
`online quotacheck
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
|
||
series.
|
||
|
||
.. _nlinks:
|
||
|
||
Case Study: File Link Count Checking
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
File link count checking also uses live update hooks.
|
||
The coordinated inode scanner is used to visit all directories on the
|
||
filesystem, and per-file link count records are stored in a sparse ``xfarray``
|
||
indexed by inumber.
|
||
During the scanning phase, each entry in a directory generates observation
|
||
data as follows:
|
||
|
||
1. If the entry is a dotdot (``'..'``) entry of the root directory, the
|
||
directory's parent link count is bumped because the root directory's dotdot
|
||
entry is self referential.
|
||
|
||
2. If the entry is a dotdot entry of a subdirectory, the parent's backref
|
||
count is bumped.
|
||
|
||
3. If the entry is neither a dot nor a dotdot entry, the target file's parent
|
||
count is bumped.
|
||
|
||
4. If the target is a subdirectory, the parent's child link count is bumped.
|
||
|
||
A crucial point to understand about how the link count inode scanner interacts
|
||
with the live update hooks is that the scan cursor tracks which *parent*
|
||
directories have been scanned.
|
||
In other words, the live updates ignore any update about ``A → B`` when A has
|
||
not been scanned, even if B has been scanned.
|
||
Furthermore, a subdirectory A with a dotdot entry pointing back to B is
|
||
accounted as a backref counter in the shadow data for A, since child dotdot
|
||
entries affect the parent's link count.
|
||
Live update hooks are carefully placed in all parts of the filesystem that
|
||
create, change, or remove directory entries, since those operations involve
|
||
bumplink and droplink.
|
||
|
||
For any file, the correct link count is the number of parents plus the number
|
||
of child subdirectories.
|
||
Non-directories never have children of any kind.
|
||
The backref information is used to detect inconsistencies in the number of
|
||
links pointing to child subdirectories and the number of dotdot entries
|
||
pointing back.
|
||
|
||
After the scan completes, the link count of each file can be checked by locking
|
||
both the inode and the shadow data, and comparing the link counts.
|
||
A second coordinated inode scan cursor is used for comparisons.
|
||
Live updates are key to being able to walk every inode without needing to hold
|
||
any locks between inodes.
|
||
If repairs are desired, the inode's link count is set to the value in the
|
||
shadow information.
|
||
If no parents are found, the file must be :ref:`reparented <orphanage>` to the
|
||
orphanage to prevent the file from being lost forever.
|
||
|
||
The proposed patchset is the
|
||
`file link count repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
|
||
series.
|
||
|
||
.. _rmap_repair:
|
||
|
||
Case Study: Rebuilding Reverse Mapping Records
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Most repair functions follow the same pattern: lock filesystem resources,
|
||
walk the surviving ondisk metadata looking for replacement metadata records,
|
||
and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
|
||
The primary advantage of this approach is the simplicity and modularity of the
|
||
repair code -- code and data are entirely contained within the scrub module,
|
||
do not require hooks in the main filesystem, and are usually the most efficient
|
||
in memory use.
|
||
A secondary advantage of this repair approach is atomicity -- once the kernel
|
||
decides a structure is corrupt, no other threads can access the metadata until
|
||
the kernel finishes repairing and revalidating the metadata.
|
||
|
||
For repairs going on within a shard of the filesystem, these advantages
|
||
outweigh the delays inherent in locking the shard while repairing parts of the
|
||
shard.
|
||
Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
|
||
btree repair strategy because it must scan every space mapping of every fork of
|
||
every file in the filesystem, and the filesystem cannot stop.
|
||
Therefore, rmap repair foregoes atomicity between scrub and repair.
|
||
It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
|
||
<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
|
||
scan for reverse mapping records.
|
||
|
||
1. Set up an xfbtree to stage rmap records.
|
||
|
||
2. While holding the locks on the AGI and AGF buffers acquired during the
|
||
scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
|
||
staging extents, and the internal log.
|
||
|
||
3. Set up an inode scanner.
|
||
|
||
4. Hook into rmap updates for the AG being repaired so that the live scan data
|
||
can receive updates to the rmap btree from the rest of the filesystem during
|
||
the file scan.
|
||
|
||
5. For each space mapping found in either fork of each file scanned,
|
||
decide if the mapping matches the AG of interest.
|
||
If so:
|
||
|
||
a. Create a btree cursor for the in-memory btree.
|
||
|
||
b. Use the rmap code to add the record to the in-memory btree.
|
||
|
||
c. Use the :ref:`special commit function <xfbtree_commit>` to write the
|
||
xfbtree changes to the xfile.
|
||
|
||
6. For each live update received via the hook, decide if the owner has already
|
||
been scanned.
|
||
If so, apply the live update into the scan data:
|
||
|
||
a. Create a btree cursor for the in-memory btree.
|
||
|
||
b. Replay the operation into the in-memory btree.
|
||
|
||
c. Use the :ref:`special commit function <xfbtree_commit>` to write the
|
||
xfbtree changes to the xfile.
|
||
This is performed with an empty transaction to avoid changing the
|
||
caller's state.
|
||
|
||
7. When the inode scan finishes, create a new scrub transaction and relock the
|
||
two AG headers.
|
||
|
||
8. Compute the new btree geometry using the number of rmap records in the
|
||
shadow btree, like all other btree rebuilding functions.
|
||
|
||
9. Allocate the number of blocks computed in the previous step.
|
||
|
||
10. Perform the usual btree bulk loading and commit to install the new rmap
|
||
btree.
|
||
|
||
11. Reap the old rmap btree blocks as discussed in the case study about how
|
||
to :ref:`reap after rmap btree repair <rmap_reap>`.
|
||
|
||
12. Free the xfbtree now that it not needed.
|
||
|
||
The proposed patchset is the
|
||
`rmap repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
|
||
series.
|
||
|
||
Staging Repairs with Temporary Files on Disk
|
||
--------------------------------------------
|
||
|
||
XFS stores a substantial amount of metadata in file forks: directories,
|
||
extended attributes, symbolic link targets, free space bitmaps and summary
|
||
information for the realtime volume, and quota records.
|
||
File forks map 64-bit logical file fork space extents to physical storage space
|
||
extents, similar to how a memory management unit maps 64-bit virtual addresses
|
||
to physical memory addresses.
|
||
Therefore, file-based tree structures (such as directories and extended
|
||
attributes) use blocks mapped in the file fork offset address space that point
|
||
to other blocks mapped within that same address space, and file-based linear
|
||
structures (such as bitmaps and quota records) compute array element offsets in
|
||
the file fork offset address space.
|
||
|
||
Because file forks can consume as much space as the entire filesystem, repairs
|
||
cannot be staged in memory, even when a paging scheme is available.
|
||
Therefore, online repair of file-based metadata createas a temporary file in
|
||
the XFS filesystem, writes a new structure at the correct offsets into the
|
||
temporary file, and atomically swaps the fork mappings (and hence the fork
|
||
contents) to commit the repair.
|
||
Once the repair is complete, the old fork can be reaped as necessary; if the
|
||
system goes down during the reap, the iunlink code will delete the blocks
|
||
during log recovery.
|
||
|
||
**Note**: All space usage and inode indices in the filesystem *must* be
|
||
consistent to use a temporary file safely!
|
||
This dependency is the reason why online repair can only use pageable kernel
|
||
memory to stage ondisk space usage information.
|
||
|
||
Swapping metadata extents with a temporary file requires the owner field of the
|
||
block headers to match the file being repaired and not the temporary file. The
|
||
directory, extended attribute, and symbolic link functions were all modified to
|
||
allow callers to specify owner numbers explicitly.
|
||
|
||
There is a downside to the reaping process -- if the system crashes during the
|
||
reap phase and the fork extents are crosslinked, the iunlink processing will
|
||
fail because freeing space will find the extra reverse mappings and abort.
|
||
|
||
Temporary files created for repair are similar to ``O_TMPFILE`` files created
|
||
by userspace.
|
||
They are not linked into a directory and the entire file will be reaped when
|
||
the last reference to the file is lost.
|
||
The key differences are that these files must have no access permission outside
|
||
the kernel at all, they must be specially marked to prevent them from being
|
||
opened by handle, and they must never be linked into the directory tree.
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Historical Sidebar**: |
|
||
+--------------------------------------------------------------------------+
|
||
| In the initial iteration of file metadata repair, the damaged metadata |
|
||
| blocks would be scanned for salvageable data; the extents in the file |
|
||
| fork would be reaped; and then a new structure would be built in its |
|
||
| place. |
|
||
| This strategy did not survive the introduction of the atomic repair |
|
||
| requirement expressed earlier in this document. |
|
||
| |
|
||
| The second iteration explored building a second structure at a high |
|
||
| offset in the fork from the salvage data, reaping the old extents, and |
|
||
| using a ``COLLAPSE_RANGE`` operation to slide the new extents into |
|
||
| place. |
|
||
| |
|
||
| This had many drawbacks: |
|
||
| |
|
||
| - Array structures are linearly addressed, and the regular filesystem |
|
||
| codebase does not have the concept of a linear offset that could be |
|
||
| applied to the record offset computation to build an alternate copy. |
|
||
| |
|
||
| - Extended attributes are allowed to use the entire attr fork offset |
|
||
| address space. |
|
||
| |
|
||
| - Even if repair could build an alternate copy of a data structure in a |
|
||
| different part of the fork address space, the atomic repair commit |
|
||
| requirement means that online repair would have to be able to perform |
|
||
| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old |
|
||
| structure was completely replaced. |
|
||
| |
|
||
| - A crash after construction of the secondary tree but before the range |
|
||
| collapse would leave unreachable blocks in the file fork. |
|
||
| This would likely confuse things further. |
|
||
| |
|
||
| - Reaping blocks after a repair is not a simple operation, and |
|
||
| initiating a reap operation from a restarted range collapse operation |
|
||
| during log recovery is daunting. |
|
||
| |
|
||
| - Directory entry blocks and quota records record the file fork offset |
|
||
| in the header area of each block. |
|
||
| An atomic range collapse operation would have to rewrite this part of |
|
||
| each block header. |
|
||
| Rewriting a single field in block headers is not a huge problem, but |
|
||
| it's something to be aware of. |
|
||
| |
|
||
| - Each block in a directory or extended attributes btree index contains |
|
||
| sibling and child block pointers. |
|
||
| Were the atomic commit to use a range collapse operation, each block |
|
||
| would have to be rewritten very carefully to preserve the graph |
|
||
| structure. |
|
||
| Doing this as part of a range collapse means rewriting a large number |
|
||
| of blocks repeatedly, which is not conducive to quick repairs. |
|
||
| |
|
||
| This lead to the introduction of temporary file staging. |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
Using a Temporary File
|
||
``````````````````````
|
||
|
||
Online repair code should use the ``xrep_tempfile_create`` function to create a
|
||
temporary file inside the filesystem.
|
||
This allocates an inode, marks the in-core inode private, and attaches it to
|
||
the scrub context.
|
||
These files are hidden from userspace, may not be added to the directory tree,
|
||
and must be kept private.
|
||
|
||
Temporary files only use two inode locks: the IOLOCK and the ILOCK.
|
||
The MMAPLOCK is not needed here, because there must not be page faults from
|
||
userspace for data fork blocks.
|
||
The usage patterns of these two locks are the same as for any other XFS file --
|
||
access to file data are controlled via the IOLOCK, and access to file metadata
|
||
are controlled via the ILOCK.
|
||
Locking helpers are provided so that the temporary file and its lock state can
|
||
be cleaned up by the scrub context.
|
||
To comply with the nested locking strategy laid out in the :ref:`inode
|
||
locking<ilocking>` section, it is recommended that scrub functions use the
|
||
xrep_tempfile_ilock*_nowait lock helpers.
|
||
|
||
Data can be written to a temporary file by two means:
|
||
|
||
1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
|
||
temporary file from an xfile.
|
||
|
||
2. The regular directory, symbolic link, and extended attribute functions can
|
||
be used to write to the temporary file.
|
||
|
||
Once a good copy of a data file has been constructed in a temporary file, it
|
||
must be conveyed to the file being repaired, which is the topic of the next
|
||
section.
|
||
|
||
The proposed patches are in the
|
||
`repair temporary files
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
|
||
series.
|
||
|
||
Atomic Extent Swapping
|
||
----------------------
|
||
|
||
Once repair builds a temporary file with a new data structure written into
|
||
it, it must commit the new changes into the existing file.
|
||
It is not possible to swap the inumbers of two files, so instead the new
|
||
metadata must replace the old.
|
||
This suggests the need for the ability to swap extents, but the existing extent
|
||
swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
|
||
for online repair because:
|
||
|
||
a. When the reverse-mapping btree is enabled, the swap code must keep the
|
||
reverse mapping information up to date with every exchange of mappings.
|
||
Therefore, it can only exchange one mapping per transaction, and each
|
||
transaction is independent.
|
||
|
||
b. Reverse-mapping is critical for the operation of online fsck, so the old
|
||
defragmentation code (which swapped entire extent forks in a single
|
||
operation) is not useful here.
|
||
|
||
c. Defragmentation is assumed to occur between two files with identical
|
||
contents.
|
||
For this use case, an incomplete exchange will not result in a user-visible
|
||
change in file contents, even if the operation is interrupted.
|
||
|
||
d. Online repair needs to swap the contents of two files that are by definition
|
||
*not* identical.
|
||
For directory and xattr repairs, the user-visible contents might be the
|
||
same, but the contents of individual blocks may be very different.
|
||
|
||
e. Old blocks in the file may be cross-linked with another structure and must
|
||
not reappear if the system goes down mid-repair.
|
||
|
||
These problems are overcome by creating a new deferred operation and a new type
|
||
of log intent item to track the progress of an operation to exchange two file
|
||
ranges.
|
||
The new deferred operation type chains together the same transactions used by
|
||
the reverse-mapping extent swap code.
|
||
The new log item records the progress of the exchange to ensure that once an
|
||
exchange begins, it will always run to completion, even there are
|
||
interruptions.
|
||
The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
|
||
in the superblock protects these new log item records from being replayed on
|
||
old kernels.
|
||
|
||
The proposed patchset is the
|
||
`atomic extent swap
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
|
||
series.
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Sidebar: Using Log-Incompatible Feature Flags** |
|
||
+--------------------------------------------------------------------------+
|
||
| Starting with XFS v5, the superblock contains a |
|
||
| ``sb_features_log_incompat`` field to indicate that the log contains |
|
||
| records that might not readable by all kernels that could mount this |
|
||
| filesystem. |
|
||
| In short, log incompat features protect the log contents against kernels |
|
||
| that will not understand the contents. |
|
||
| Unlike the other superblock feature bits, log incompat bits are |
|
||
| ephemeral because an empty (clean) log does not need protection. |
|
||
| The log cleans itself after its contents have been committed into the |
|
||
| filesystem, either as part of an unmount or because the system is |
|
||
| otherwise idle. |
|
||
| Because upper level code can be working on a transaction at the same |
|
||
| time that the log cleans itself, it is necessary for upper level code to |
|
||
| communicate to the log when it is going to use a log incompatible |
|
||
| feature. |
|
||
| |
|
||
| The log coordinates access to incompatible features through the use of |
|
||
| one ``struct rw_semaphore`` for each feature. |
|
||
| The log cleaning code tries to take this rwsem in exclusive mode to |
|
||
| clear the bit; if the lock attempt fails, the feature bit remains set. |
|
||
| Filesystem code signals its intention to use a log incompat feature in a |
|
||
| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
|
||
| in shared mode. |
|
||
| The code supporting a log incompat feature should create wrapper |
|
||
| functions to obtain the log feature and call |
|
||
| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary |
|
||
| superblock. |
|
||
| The superblock update is performed transactionally, so the wrapper to |
|
||
| obtain log assistance must be called just prior to the creation of the |
|
||
| transaction that uses the functionality. |
|
||
| For a file operation, this step must happen after taking the IOLOCK |
|
||
| and the MMAPLOCK, but before allocating the transaction. |
|
||
| When the transaction is complete, the ``xlog_drop_incompat_feat`` |
|
||
| function is called to release the feature. |
|
||
| The feature bit will not be cleared from the superblock until the log |
|
||
| becomes clean. |
|
||
| |
|
||
| Log-assisted extended attribute updates and atomic extent swaps both use |
|
||
| log incompat features and provide convenience wrappers around the |
|
||
| functionality. |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
Mechanics of an Atomic Extent Swap
|
||
``````````````````````````````````
|
||
|
||
Swapping entire file forks is a complex task.
|
||
The goal is to exchange all file fork mappings between two file fork offset
|
||
ranges.
|
||
There are likely to be many extent mappings in each fork, and the edges of
|
||
the mappings aren't necessarily aligned.
|
||
Furthermore, there may be other updates that need to happen after the swap,
|
||
such as exchanging file sizes, inode flags, or conversion of fork data to local
|
||
format.
|
||
This is roughly the format of the new deferred extent swap work item:
|
||
|
||
.. code-block:: c
|
||
|
||
struct xfs_swapext_intent {
|
||
/* Inodes participating in the operation. */
|
||
struct xfs_inode *sxi_ip1;
|
||
struct xfs_inode *sxi_ip2;
|
||
|
||
/* File offset range information. */
|
||
xfs_fileoff_t sxi_startoff1;
|
||
xfs_fileoff_t sxi_startoff2;
|
||
xfs_filblks_t sxi_blockcount;
|
||
|
||
/* Set these file sizes after the operation, unless negative. */
|
||
xfs_fsize_t sxi_isize1;
|
||
xfs_fsize_t sxi_isize2;
|
||
|
||
/* XFS_SWAP_EXT_* log operation flags */
|
||
uint64_t sxi_flags;
|
||
};
|
||
|
||
The new log intent item contains enough information to track two logical fork
|
||
offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
|
||
blockcount)``.
|
||
Each step of a swap operation exchanges the largest file range mapping possible
|
||
from one file to the other.
|
||
After each step in the swap operation, the two startoff fields are incremented
|
||
and the blockcount field is decremented to reflect the progress made.
|
||
The flags field captures behavioral parameters such as swapping the attr fork
|
||
instead of the data fork and other work to be done after the extent swap.
|
||
The two isize fields are used to swap the file size at the end of the operation
|
||
if the file data fork is the target of the swap operation.
|
||
|
||
When the extent swap is initiated, the sequence of operations is as follows:
|
||
|
||
1. Create a deferred work item for the extent swap.
|
||
At the start, it should contain the entirety of the file ranges to be
|
||
swapped.
|
||
|
||
2. Call ``xfs_defer_finish`` to process the exchange.
|
||
This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
|
||
This will log an extent swap intent item to the transaction for the deferred
|
||
extent swap work item.
|
||
|
||
3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
|
||
|
||
a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
|
||
``sxi_startoff2``, respectively, and compute the longest extent that can
|
||
be swapped in a single step.
|
||
This is the minimum of the two ``br_blockcount`` s in the mappings.
|
||
Keep advancing through the file forks until at least one of the mappings
|
||
contains written blocks.
|
||
Mutual holes, unwritten extents, and extent mappings to the same physical
|
||
space are not exchanged.
|
||
|
||
For the next few steps, this document will refer to the mapping that came
|
||
from file 1 as "map1", and the mapping that came from file 2 as "map2".
|
||
|
||
b. Create a deferred block mapping update to unmap map1 from file 1.
|
||
|
||
c. Create a deferred block mapping update to unmap map2 from file 2.
|
||
|
||
d. Create a deferred block mapping update to map map1 into file 2.
|
||
|
||
e. Create a deferred block mapping update to map map2 into file 1.
|
||
|
||
f. Log the block, quota, and extent count updates for both files.
|
||
|
||
g. Extend the ondisk size of either file if necessary.
|
||
|
||
h. Log an extent swap done log item for the extent swap intent log item
|
||
that was read at the start of step 3.
|
||
|
||
i. Compute the amount of file range that has just been covered.
|
||
This quantity is ``(map1.br_startoff + map1.br_blockcount -
|
||
sxi_startoff1)``, because step 3a could have skipped holes.
|
||
|
||
j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
|
||
by the number of blocks computed in the previous step, and decrease
|
||
``sxi_blockcount`` by the same quantity.
|
||
This advances the cursor.
|
||
|
||
k. Log a new extent swap intent log item reflecting the advanced state of
|
||
the work item.
|
||
|
||
l. Return the proper error code (EAGAIN) to the deferred operation manager
|
||
to inform it that there is more work to be done.
|
||
The operation manager completes the deferred work in steps 3b-3e before
|
||
moving back to the start of step 3.
|
||
|
||
4. Perform any post-processing.
|
||
This will be discussed in more detail in subsequent sections.
|
||
|
||
If the filesystem goes down in the middle of an operation, log recovery will
|
||
find the most recent unfinished extent swap log intent item and restart from
|
||
there.
|
||
This is how extent swapping guarantees that an outside observer will either see
|
||
the old broken structure or the new one, and never a mismash of both.
|
||
|
||
Preparation for Extent Swapping
|
||
```````````````````````````````
|
||
|
||
There are a few things that need to be taken care of before initiating an
|
||
atomic extent swap operation.
|
||
First, regular files require the page cache to be flushed to disk before the
|
||
operation begins, and directio writes to be quiesced.
|
||
Like any filesystem operation, extent swapping must determine the maximum
|
||
amount of disk space and quota that can be consumed on behalf of both files in
|
||
the operation, and reserve that quantity of resources to avoid an unrecoverable
|
||
out of space failure once it starts dirtying metadata.
|
||
The preparation step scans the ranges of both files to estimate:
|
||
|
||
- Data device blocks needed to handle the repeated updates to the fork
|
||
mappings.
|
||
- Change in data and realtime block counts for both files.
|
||
- Increase in quota usage for both files, if the two files do not share the
|
||
same set of quota ids.
|
||
- The number of extent mappings that will be added to each file.
|
||
- Whether or not there are partially written realtime extents.
|
||
User programs must never be able to access a realtime file extent that maps
|
||
to different extents on the realtime volume, which could happen if the
|
||
operation fails to run to completion.
|
||
|
||
The need for precise estimation increases the run time of the swap operation,
|
||
but it is very important to maintain correct accounting.
|
||
The filesystem must not run completely out of free space, nor can the extent
|
||
swap ever add more extent mappings to a fork than it can support.
|
||
Regular users are required to abide the quota limits, though metadata repairs
|
||
may exceed quota to resolve inconsistent metadata elsewhere.
|
||
|
||
Special Features for Swapping Metadata File Extents
|
||
```````````````````````````````````````````````````
|
||
|
||
Extended attributes, symbolic links, and directories can set the fork format to
|
||
"local" and treat the fork as a literal area for data storage.
|
||
Metadata repairs must take extra steps to support these cases:
|
||
|
||
- If both forks are in local format and the fork areas are large enough, the
|
||
swap is performed by copying the incore fork contents, logging both forks,
|
||
and committing.
|
||
The atomic extent swap mechanism is not necessary, since this can be done
|
||
with a single transaction.
|
||
|
||
- If both forks map blocks, then the regular atomic extent swap is used.
|
||
|
||
- Otherwise, only one fork is in local format.
|
||
The contents of the local format fork are converted to a block to perform the
|
||
swap.
|
||
The conversion to block format must be done in the same transaction that
|
||
logs the initial extent swap intent log item.
|
||
The regular atomic extent swap is used to exchange the mappings.
|
||
Special flags are set on the swap operation so that the transaction can be
|
||
rolled one more time to convert the second file's fork back to local format
|
||
so that the second file will be ready to go as soon as the ILOCK is dropped.
|
||
|
||
Extended attributes and directories stamp the owning inode into every block,
|
||
but the buffer verifiers do not actually check the inode number!
|
||
Although there is no verification, it is still important to maintain
|
||
referential integrity, so prior to performing the extent swap, online repair
|
||
builds every block in the new data structure with the owner field of the file
|
||
being repaired.
|
||
|
||
After a successful swap operation, the repair operation must reap the old fork
|
||
blocks by processing each fork mapping through the standard :ref:`file extent
|
||
reaping <reaping>` mechanism that is done post-repair.
|
||
If the filesystem should go down during the reap part of the repair, the
|
||
iunlink processing at the end of recovery will free both the temporary file and
|
||
whatever blocks were not reaped.
|
||
However, this iunlink processing omits the cross-link detection of online
|
||
repair, and is not completely foolproof.
|
||
|
||
Swapping Temporary File Extents
|
||
```````````````````````````````
|
||
|
||
To repair a metadata file, online repair proceeds as follows:
|
||
|
||
1. Create a temporary repair file.
|
||
|
||
2. Use the staging data to write out new contents into the temporary repair
|
||
file.
|
||
The same fork must be written to as is being repaired.
|
||
|
||
3. Commit the scrub transaction, since the swap estimation step must be
|
||
completed before transaction reservations are made.
|
||
|
||
4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
|
||
the appropriate resource reservations, locks, and fill out a ``struct
|
||
xfs_swapext_req`` with the details of the swap operation.
|
||
|
||
5. Call ``xrep_tempswap_contents`` to swap the contents.
|
||
|
||
6. Commit the transaction to complete the repair.
|
||
|
||
.. _rtsummary:
|
||
|
||
Case Study: Repairing the Realtime Summary File
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
In the "realtime" section of an XFS filesystem, free space is tracked via a
|
||
bitmap, similar to Unix FFS.
|
||
Each bit in the bitmap represents one realtime extent, which is a multiple of
|
||
the filesystem block size between 4KiB and 1GiB in size.
|
||
The realtime summary file indexes the number of free extents of a given size to
|
||
the offset of the block within the realtime free space bitmap where those free
|
||
extents begin.
|
||
In other words, the summary file helps the allocator find free extents by
|
||
length, similar to what the free space by count (cntbt) btree does for the data
|
||
section.
|
||
|
||
The summary file itself is a flat file (with no block headers or checksums!)
|
||
partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
|
||
counters to match the number of blocks in the rt bitmap.
|
||
Each counter records the number of free extents that start in that bitmap block
|
||
and can satisfy a power-of-two allocation request.
|
||
|
||
To check the summary file against the bitmap:
|
||
|
||
1. Take the ILOCK of both the realtime bitmap and summary files.
|
||
|
||
2. For each free space extent recorded in the bitmap:
|
||
|
||
a. Compute the position in the summary file that contains a counter that
|
||
represents this free extent.
|
||
|
||
b. Read the counter from the xfile.
|
||
|
||
c. Increment it, and write it back to the xfile.
|
||
|
||
3. Compare the contents of the xfile against the ondisk file.
|
||
|
||
To repair the summary file, write the xfile contents into the temporary file
|
||
and use atomic extent swap to commit the new contents.
|
||
The temporary file is then reaped.
|
||
|
||
The proposed patchset is the
|
||
`realtime summary repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
|
||
series.
|
||
|
||
Case Study: Salvaging Extended Attributes
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
In XFS, extended attributes are implemented as a namespaced name-value store.
|
||
Values are limited in size to 64KiB, but there is no limit in the number of
|
||
names.
|
||
The attribute fork is unpartitioned, which means that the root of the attribute
|
||
structure is always in logical block zero, but attribute leaf blocks, dabtree
|
||
index blocks, and remote value blocks are intermixed.
|
||
Attribute leaf blocks contain variable-sized records that associate
|
||
user-provided names with the user-provided values.
|
||
Values larger than a block are allocated separate extents and written there.
|
||
If the leaf information expands beyond a single block, a directory/attribute
|
||
btree (``dabtree``) is created to map hashes of attribute names to entries
|
||
for fast lookup.
|
||
|
||
Salvaging extended attributes is done as follows:
|
||
|
||
1. Walk the attr fork mappings of the file being repaired to find the attribute
|
||
leaf blocks.
|
||
When one is found,
|
||
|
||
a. Walk the attr leaf block to find candidate keys.
|
||
When one is found,
|
||
|
||
1. Check the name for problems, and ignore the name if there are.
|
||
|
||
2. Retrieve the value.
|
||
If that succeeds, add the name and value to the staging xfarray and
|
||
xfblob.
|
||
|
||
2. If the memory usage of the xfarray and xfblob exceed a certain amount of
|
||
memory or there are no more attr fork blocks to examine, unlock the file and
|
||
add the staged extended attributes to the temporary file.
|
||
|
||
3. Use atomic extent swapping to exchange the new and old extended attribute
|
||
structures.
|
||
The old attribute blocks are now attached to the temporary file.
|
||
|
||
4. Reap the temporary file.
|
||
|
||
The proposed patchset is the
|
||
`extended attribute repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
|
||
series.
|
||
|
||
Fixing Directories
|
||
------------------
|
||
|
||
Fixing directories is difficult with currently available filesystem features,
|
||
since directory entries are not redundant.
|
||
The offline repair tool scans all inodes to find files with nonzero link count,
|
||
and then it scans all directories to establish parentage of those linked files.
|
||
Damaged files and directories are zapped, and files with no parent are
|
||
moved to the ``/lost+found`` directory.
|
||
It does not try to salvage anything.
|
||
|
||
The best that online repair can do at this time is to read directory data
|
||
blocks and salvage any dirents that look plausible, correct link counts, and
|
||
move orphans back into the directory tree.
|
||
The salvage process is discussed in the case study at the end of this section.
|
||
The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
|
||
and moving orphans to the ``/lost+found`` directory.
|
||
|
||
Case Study: Salvaging Directories
|
||
`````````````````````````````````
|
||
|
||
Unlike extended attributes, directory blocks are all the same size, so
|
||
salvaging directories is straightforward:
|
||
|
||
1. Find the parent of the directory.
|
||
If the dotdot entry is not unreadable, try to confirm that the alleged
|
||
parent has a child entry pointing back to the directory being repaired.
|
||
Otherwise, walk the filesystem to find it.
|
||
|
||
2. Walk the first partition of data fork of the directory to find the directory
|
||
entry data blocks.
|
||
When one is found,
|
||
|
||
a. Walk the directory data block to find candidate entries.
|
||
When an entry is found:
|
||
|
||
i. Check the name for problems, and ignore the name if there are.
|
||
|
||
ii. Retrieve the inumber and grab the inode.
|
||
If that succeeds, add the name, inode number, and file type to the
|
||
staging xfarray and xblob.
|
||
|
||
3. If the memory usage of the xfarray and xfblob exceed a certain amount of
|
||
memory or there are no more directory data blocks to examine, unlock the
|
||
directory and add the staged dirents into the temporary directory.
|
||
Truncate the staging files.
|
||
|
||
4. Use atomic extent swapping to exchange the new and old directory structures.
|
||
The old directory blocks are now attached to the temporary file.
|
||
|
||
5. Reap the temporary file.
|
||
|
||
**Future Work Question**: Should repair revalidate the dentry cache when
|
||
rebuilding a directory?
|
||
|
||
*Answer*: Yes, it should.
|
||
|
||
In theory it is necessary to scan all dentry cache entries for a directory to
|
||
ensure that one of the following apply:
|
||
|
||
1. The cached dentry reflects an ondisk dirent in the new directory.
|
||
|
||
2. The cached dentry no longer has a corresponding ondisk dirent in the new
|
||
directory and the dentry can be purged from the cache.
|
||
|
||
3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
|
||
purged.
|
||
This is the problem case.
|
||
|
||
Unfortunately, the current dentry cache design doesn't provide a means to walk
|
||
every child dentry of a specific directory, which makes this a hard problem.
|
||
There is no known solution.
|
||
|
||
The proposed patchset is the
|
||
`directory repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
|
||
series.
|
||
|
||
Parent Pointers
|
||
```````````````
|
||
|
||
A parent pointer is a piece of file metadata that enables a user to locate the
|
||
file's parent directory without having to traverse the directory tree from the
|
||
root.
|
||
Without them, reconstruction of directory trees is hindered in much the same
|
||
way that the historic lack of reverse space mapping information once hindered
|
||
reconstruction of filesystem space metadata.
|
||
The parent pointer feature, however, makes total directory reconstruction
|
||
possible.
|
||
|
||
XFS parent pointers include the dirent name and location of the entry within
|
||
the parent directory.
|
||
In other words, child files use extended attributes to store pointers to
|
||
parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
|
||
The directory checking process can be strengthened to ensure that the target of
|
||
each dirent also contains a parent pointer pointing back to the dirent.
|
||
Likewise, each parent pointer can be checked by ensuring that the target of
|
||
each parent pointer is a directory and that it contains a dirent matching
|
||
the parent pointer.
|
||
Both online and offline repair can use this strategy.
|
||
|
||
**Note**: The ondisk format of parent pointers is not yet finalized.
|
||
|
||
+--------------------------------------------------------------------------+
|
||
| **Historical Sidebar**: |
|
||
+--------------------------------------------------------------------------+
|
||
| Directory parent pointers were first proposed as an XFS feature more |
|
||
| than a decade ago by SGI. |
|
||
| Each link from a parent directory to a child file is mirrored with an |
|
||
| extended attribute in the child that could be used to identify the |
|
||
| parent directory. |
|
||
| Unfortunately, this early implementation had major shortcomings and was |
|
||
| never merged into Linux XFS: |
|
||
| |
|
||
| 1. The XFS codebase of the late 2000s did not have the infrastructure to |
|
||
| enforce strong referential integrity in the directory tree. |
|
||
| It did not guarantee that a change in a forward link would always be |
|
||
| followed up with the corresponding change to the reverse links. |
|
||
| |
|
||
| 2. Referential integrity was not integrated into offline repair. |
|
||
| Checking and repairs were performed on mounted filesystems without |
|
||
| taking any kernel or inode locks to coordinate access. |
|
||
| It is not clear how this actually worked properly. |
|
||
| |
|
||
| 3. The extended attribute did not record the name of the directory entry |
|
||
| in the parent, so the SGI parent pointer implementation cannot be |
|
||
| used to reconnect the directory tree. |
|
||
| |
|
||
| 4. Extended attribute forks only support 65,536 extents, which means |
|
||
| that parent pointer attribute creation is likely to fail at some |
|
||
| point before the maximum file link count is achieved. |
|
||
| |
|
||
| The original parent pointer design was too unstable for something like |
|
||
| a file system repair to depend on. |
|
||
| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a |
|
||
| second implementation that solves all shortcomings of the first. |
|
||
| During 2022, Allison introduced log intent items to track physical |
|
||
| manipulations of the extended attribute structures. |
|
||
| This solves the referential integrity problem by making it possible to |
|
||
| commit a dirent update and a parent pointer update in the same |
|
||
| transaction. |
|
||
| Chandan increased the maximum extent counts of both data and attribute |
|
||
| forks, thereby ensuring that the extended attribute structure can grow |
|
||
| to handle the maximum hardlink count of any file. |
|
||
+--------------------------------------------------------------------------+
|
||
|
||
Case Study: Repairing Directories with Parent Pointers
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
|
||
a :ref:`directory entry live update hook <liveupdate>` as follows:
|
||
|
||
1. Set up a temporary directory for generating the new directory structure,
|
||
an xfblob for storing entry names, and an xfarray for stashing directory
|
||
updates.
|
||
|
||
2. Set up an inode scanner and hook into the directory entry code to receive
|
||
updates on directory operations.
|
||
|
||
3. For each parent pointer found in each file scanned, decide if the parent
|
||
pointer references the directory of interest.
|
||
If so:
|
||
|
||
a. Stash an addname entry for this dirent in the xfarray for later.
|
||
|
||
b. When finished scanning that file, flush the stashed updates to the
|
||
temporary directory.
|
||
|
||
4. For each live directory update received via the hook, decide if the child
|
||
has already been scanned.
|
||
If so:
|
||
|
||
a. Stash an addname or removename entry for this dirent update in the
|
||
xfarray for later.
|
||
We cannot write directly to the temporary directory because hook
|
||
functions are not allowed to modify filesystem metadata.
|
||
Instead, we stash updates in the xfarray and rely on the scanner thread
|
||
to apply the stashed updates to the temporary directory.
|
||
|
||
5. When the scan is complete, atomically swap the contents of the temporary
|
||
directory and the directory being repaired.
|
||
The temporary directory now contains the damaged directory structure.
|
||
|
||
6. Reap the temporary directory.
|
||
|
||
7. Update the dirent position field of parent pointers as necessary.
|
||
This may require the queuing of a substantial number of xattr log intent
|
||
items.
|
||
|
||
The proposed patchset is the
|
||
`parent pointers directory repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
|
||
series.
|
||
|
||
**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
|
||
match in the reconstructed directory?
|
||
|
||
*Answer*: There are a few ways to solve this problem:
|
||
|
||
1. The field could be designated advisory, since the other three values are
|
||
sufficient to find the entry in the parent.
|
||
However, this makes indexed key lookup impossible while repairs are ongoing.
|
||
|
||
2. We could allow creating directory entries at specified offsets, which solves
|
||
the referential integrity problem but runs the risk that dirent creation
|
||
will fail due to conflicts with the free space in the directory.
|
||
|
||
These conflicts could be resolved by appending the directory entry and
|
||
amending the xattr code to support updating an xattr key and reindexing the
|
||
dabtree, though this would have to be performed with the parent directory
|
||
still locked.
|
||
|
||
3. Same as above, but remove the old parent pointer entry and add a new one
|
||
atomically.
|
||
|
||
4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
|
||
which would provide the attr name uniqueness that we require, without
|
||
forcing repair code to update the dirent position.
|
||
Unfortunately, this requires changes to the xattr code to support attr
|
||
names as long as 263 bytes.
|
||
|
||
5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
|
||
(name, parent_gen)``.
|
||
If the hash is sufficiently resistant to collisions (e.g. sha256) then
|
||
this should provide the attr name uniqueness that we require.
|
||
Names shorter than 247 bytes could be stored directly.
|
||
|
||
Discussion is ongoing under the `parent pointers patch deluge
|
||
<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
|
||
|
||
Case Study: Repairing Parent Pointers
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Online reconstruction of a file's parent pointer information works similarly to
|
||
directory reconstruction:
|
||
|
||
1. Set up a temporary file for generating a new extended attribute structure,
|
||
an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
|
||
stashing parent pointer updates.
|
||
|
||
2. Set up an inode scanner and hook into the directory entry code to receive
|
||
updates on directory operations.
|
||
|
||
3. For each directory entry found in each directory scanned, decide if the
|
||
dirent references the file of interest.
|
||
If so:
|
||
|
||
a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
|
||
for later.
|
||
|
||
b. When finished scanning the directory, flush the stashed updates to the
|
||
temporary directory.
|
||
|
||
4. For each live directory update received via the hook, decide if the parent
|
||
has already been scanned.
|
||
If so:
|
||
|
||
a. Stash an addpptr or removepptr entry for this dirent update in the
|
||
xfarray for later.
|
||
We cannot write parent pointers directly to the temporary file because
|
||
hook functions are not allowed to modify filesystem metadata.
|
||
Instead, we stash updates in the xfarray and rely on the scanner thread
|
||
to apply the stashed parent pointer updates to the temporary file.
|
||
|
||
5. Copy all non-parent pointer extended attributes to the temporary file.
|
||
|
||
6. When the scan is complete, atomically swap the attribute fork of the
|
||
temporary file and the file being repaired.
|
||
The temporary file now contains the damaged extended attribute structure.
|
||
|
||
7. Reap the temporary file.
|
||
|
||
The proposed patchset is the
|
||
`parent pointers repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
|
||
series.
|
||
|
||
Digression: Offline Checking of Parent Pointers
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Examining parent pointers in offline repair works differently because corrupt
|
||
files are erased long before directory tree connectivity checks are performed.
|
||
Parent pointer checks are therefore a second pass to be added to the existing
|
||
connectivity checks:
|
||
|
||
1. After the set of surviving files has been established (i.e. phase 6),
|
||
walk the surviving directories of each AG in the filesystem.
|
||
This is already performed as part of the connectivity checks.
|
||
|
||
2. For each directory entry found, record the name in an xfblob, and store
|
||
``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
|
||
per-AG in-memory slab.
|
||
|
||
3. For each AG in the filesystem,
|
||
|
||
a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
|
||
dirent_pos.
|
||
|
||
b. For each inode in the AG,
|
||
|
||
1. Scan the inode for parent pointers.
|
||
Record the names in a per-file xfblob, and store ``(parent_inum,
|
||
parent_gen, dirent_pos)`` tuples in a per-file slab.
|
||
|
||
2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
|
||
|
||
3. Position one slab cursor at the start of the inode's records in the
|
||
per-AG tuple slab.
|
||
This should be trivial since the per-AG tuples are in child inumber
|
||
order.
|
||
|
||
4. Position a second slab cursor at the start of the per-file tuple slab.
|
||
|
||
5. Iterate the two cursors in lockstep, comparing the parent_ino and
|
||
dirent_pos fields of the records under each cursor.
|
||
|
||
a. Tuples in the per-AG list but not the per-file list are missing and
|
||
need to be written to the inode.
|
||
|
||
b. Tuples in the per-file list but not the per-AG list are dangling
|
||
and need to be removed from the inode.
|
||
|
||
c. For tuples in both lists, update the parent_gen and name components
|
||
of the parent pointer if necessary.
|
||
|
||
4. Move on to examining link counts, as we do today.
|
||
|
||
The proposed patchset is the
|
||
`offline parent pointers repair
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
|
||
series.
|
||
|
||
Rebuilding directories from parent pointers in offline repair is very
|
||
challenging because it currently uses a single-pass scan of the filesystem
|
||
during phase 3 to decide which files are corrupt enough to be zapped.
|
||
This scan would have to be converted into a multi-pass scan:
|
||
|
||
1. The first pass of the scan zaps corrupt inodes, forks, and attributes
|
||
much as it does now.
|
||
Corrupt directories are noted but not zapped.
|
||
|
||
2. The next pass records parent pointers pointing to the directories noted
|
||
as being corrupt in the first pass.
|
||
This second pass may have to happen after the phase 4 scan for duplicate
|
||
blocks, if phase 4 is also capable of zapping directories.
|
||
|
||
3. The third pass resets corrupt directories to an empty shortform directory.
|
||
Free space metadata has not been ensured yet, so repair cannot yet use the
|
||
directory building code in libxfs.
|
||
|
||
4. At the start of phase 6, space metadata have been rebuilt.
|
||
Use the parent pointer information recorded during step 2 to reconstruct
|
||
the dirents and add them to the now-empty directories.
|
||
|
||
This code has not yet been constructed.
|
||
|
||
.. _orphanage:
|
||
|
||
The Orphanage
|
||
-------------
|
||
|
||
Filesystems present files as a directed, and hopefully acyclic, graph.
|
||
In other words, a tree.
|
||
The root of the filesystem is a directory, and each entry in a directory points
|
||
downwards either to more subdirectories or to non-directory files.
|
||
Unfortunately, a disruption in the directory graph pointers result in a
|
||
disconnected graph, which makes files impossible to access via regular path
|
||
resolution.
|
||
|
||
Without parent pointers, the directory parent pointer online scrub code can
|
||
detect a dotdot entry pointing to a parent directory that doesn't have a link
|
||
back to the child directory and the file link count checker can detect a file
|
||
that isn't pointed to by any directory in the filesystem.
|
||
If such a file has a positive link count, the file is an orphan.
|
||
|
||
With parent pointers, directories can be rebuilt by scanning parent pointers
|
||
and parent pointers can be rebuilt by scanning directories.
|
||
This should reduce the incidence of files ending up in ``/lost+found``.
|
||
|
||
When orphans are found, they should be reconnected to the directory tree.
|
||
Offline fsck solves the problem by creating a directory ``/lost+found`` to
|
||
serve as an orphanage, and linking orphan files into the orphanage by using the
|
||
inumber as the name.
|
||
Reparenting a file to the orphanage does not reset any of its permissions or
|
||
ACLs.
|
||
|
||
This process is more involved in the kernel than it is in userspace.
|
||
The directory and file link count repair setup functions must use the regular
|
||
VFS mechanisms to create the orphanage directory with all the necessary
|
||
security attributes and dentry cache entries, just like a regular directory
|
||
tree modification.
|
||
|
||
Orphaned files are adopted by the orphanage as follows:
|
||
|
||
1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
|
||
to try to ensure that the lost and found directory actually exists.
|
||
This also attaches the orphanage directory to the scrub context.
|
||
|
||
2. If the decision is made to reconnect a file, take the IOLOCK of both the
|
||
orphanage and the file being reattached.
|
||
The ``xrep_orphanage_iolock_two`` function follows the inode locking
|
||
strategy discussed earlier.
|
||
|
||
3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
|
||
to compute the new name in the orphanage and the block reservation required.
|
||
|
||
4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
|
||
transaction.
|
||
|
||
5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
|
||
and found, and update the kernel dentry cache.
|
||
|
||
The proposed patches are in the
|
||
`orphanage adoption
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
|
||
series.
|
||
|
||
6. Userspace Algorithms and Data Structures
|
||
===========================================
|
||
|
||
This section discusses the key algorithms and data structures of the userspace
|
||
program, ``xfs_scrub``, that provide the ability to drive metadata checks and
|
||
repairs in the kernel, verify file data, and look for other potential problems.
|
||
|
||
.. _scrubcheck:
|
||
|
||
Checking Metadata
|
||
-----------------
|
||
|
||
Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
|
||
That structure follows naturally from the data dependencies designed into the
|
||
filesystem from its beginnings in 1993.
|
||
In XFS, there are several groups of metadata dependencies:
|
||
|
||
a. Filesystem summary counts depend on consistency within the inode indices,
|
||
the allocation group space btrees, and the realtime volume space
|
||
information.
|
||
|
||
b. Quota resource counts depend on consistency within the quota file data
|
||
forks, inode indices, inode records, and the forks of every file on the
|
||
system.
|
||
|
||
c. The naming hierarchy depends on consistency within the directory and
|
||
extended attribute structures.
|
||
This includes file link counts.
|
||
|
||
d. Directories, extended attributes, and file data depend on consistency within
|
||
the file forks that map directory and extended attribute data to physical
|
||
storage media.
|
||
|
||
e. The file forks depends on consistency within inode records and the space
|
||
metadata indices of the allocation groups and the realtime volume.
|
||
This includes quota and realtime metadata files.
|
||
|
||
f. Inode records depends on consistency within the inode metadata indices.
|
||
|
||
g. Realtime space metadata depend on the inode records and data forks of the
|
||
realtime metadata inodes.
|
||
|
||
h. The allocation group metadata indices (free space, inodes, reference count,
|
||
and reverse mapping btrees) depend on consistency within the AG headers and
|
||
between all the AG metadata btrees.
|
||
|
||
i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
|
||
for online fsck functionality.
|
||
|
||
Therefore, a metadata dependency graph is a convenient way to schedule checking
|
||
operations in the ``xfs_scrub`` program:
|
||
|
||
- Phase 1 checks that the provided path maps to an XFS filesystem and detect
|
||
the kernel's scrubbing abilities, which validates group (i).
|
||
|
||
- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
|
||
|
||
- Phase 3 scans inodes in parallel.
|
||
For each inode, groups (f), (e), and (d) are checked, in that order.
|
||
|
||
- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
|
||
may run reliably.
|
||
|
||
- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
|
||
to checking names.
|
||
|
||
- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
|
||
to read them, and to report which blocks of which files are affected.
|
||
|
||
- Phase 7 checks group (a), having validated everything else.
|
||
|
||
Notice that the data dependencies between groups are enforced by the structure
|
||
of the program flow.
|
||
|
||
Parallel Inode Scans
|
||
--------------------
|
||
|
||
An XFS filesystem can easily contain hundreds of millions of inodes.
|
||
Given that XFS targets installations with large high-performance storage,
|
||
it is desirable to scrub inodes in parallel to minimize runtime, particularly
|
||
if the program has been invoked manually from a command line.
|
||
This requires careful scheduling to keep the threads as evenly loaded as
|
||
possible.
|
||
|
||
Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
|
||
workqueue and scheduled a single workqueue item per AG.
|
||
Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
|
||
inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
|
||
information to construct file handles.
|
||
The file handle was then passed to a function to generate scrub items for each
|
||
metadata object of each inode.
|
||
This simple algorithm leads to thread balancing problems in phase 3 if the
|
||
filesystem contains one AG with a few large sparse files and the rest of the
|
||
AGs contain many smaller files.
|
||
The inode scan dispatch function was not sufficiently granular; it should have
|
||
been dispatching at the level of individual inodes, or, to constrain memory
|
||
consumption, inode btree records.
|
||
|
||
Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
|
||
avoid this problem with ease by adding a second workqueue.
|
||
Just like before, the first workqueue is seeded with one workqueue item per AG,
|
||
and it uses INUMBERS to find inode btree chunks.
|
||
The second workqueue, however, is configured with an upper bound on the number
|
||
of items that can be waiting to be run.
|
||
Each inode btree chunk found by the first workqueue's workers are queued to the
|
||
second workqueue, and it is this second workqueue that queries BULKSTAT,
|
||
creates a file handle, and passes it to a function to generate scrub items for
|
||
each metadata object of each inode.
|
||
If the second workqueue is too full, the workqueue add function blocks the
|
||
first workqueue's workers until the backlog eases.
|
||
This doesn't completely solve the balancing problem, but reduces it enough to
|
||
move on to more pressing issues.
|
||
|
||
The proposed patchsets are the scrub
|
||
`performance tweaks
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
|
||
and the
|
||
`inode scan rebalance
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
|
||
series.
|
||
|
||
.. _scrubrepair:
|
||
|
||
Scheduling Repairs
|
||
------------------
|
||
|
||
During phase 2, corruptions and inconsistencies reported in any AGI header or
|
||
inode btree are repaired immediately, because phase 3 relies on proper
|
||
functioning of the inode indices to find inodes to scan.
|
||
Failed repairs are rescheduled to phase 4.
|
||
Problems reported in any other space metadata are deferred to phase 4.
|
||
Optimization opportunities are always deferred to phase 4, no matter their
|
||
origin.
|
||
|
||
During phase 3, corruptions and inconsistencies reported in any part of a
|
||
file's metadata are repaired immediately if all space metadata were validated
|
||
during phase 2.
|
||
Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
|
||
|
||
In the original design of ``xfs_scrub``, it was thought that repairs would be
|
||
so infrequent that the ``struct xfs_scrub_metadata`` objects used to
|
||
communicate with the kernel could also be used as the primary object to
|
||
schedule repairs.
|
||
With recent increases in the number of optimizations possible for a given
|
||
filesystem object, it became much more memory-efficient to track all eligible
|
||
repairs for a given filesystem object with a single repair item.
|
||
Each repair item represents a single lockable object -- AGs, metadata files,
|
||
individual inodes, or a class of summary information.
|
||
|
||
Phase 4 is responsible for scheduling a lot of repair work in as quick a
|
||
manner as is practical.
|
||
The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
|
||
means that ``xfs_scrub`` must try to complete the repair work scheduled by
|
||
phase 2 before trying repair work scheduled by phase 3.
|
||
The repair process is as follows:
|
||
|
||
1. Start a round of repair with a workqueue and enough workers to keep the CPUs
|
||
as busy as the user desires.
|
||
|
||
a. For each repair item queued by phase 2,
|
||
|
||
i. Ask the kernel to repair everything listed in the repair item for a
|
||
given filesystem object.
|
||
|
||
ii. Make a note if the kernel made any progress in reducing the number
|
||
of repairs needed for this object.
|
||
|
||
iii. If the object no longer requires repairs, revalidate all metadata
|
||
associated with this object.
|
||
If the revalidation succeeds, drop the repair item.
|
||
If not, requeue the item for more repairs.
|
||
|
||
b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
|
||
|
||
c. For each repair item queued by phase 3,
|
||
|
||
i. Ask the kernel to repair everything listed in the repair item for a
|
||
given filesystem object.
|
||
|
||
ii. Make a note if the kernel made any progress in reducing the number
|
||
of repairs needed for this object.
|
||
|
||
iii. If the object no longer requires repairs, revalidate all metadata
|
||
associated with this object.
|
||
If the revalidation succeeds, drop the repair item.
|
||
If not, requeue the item for more repairs.
|
||
|
||
d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
|
||
|
||
2. If step 1 made any repair progress of any kind, jump back to step 1 to start
|
||
another round of repair.
|
||
|
||
3. If there are items left to repair, run them all serially one more time.
|
||
Complain if the repairs were not successful, since this is the last chance
|
||
to repair anything.
|
||
|
||
Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
|
||
immediately.
|
||
Corrupt file data blocks reported by phase 6 cannot be recovered by the
|
||
filesystem.
|
||
|
||
The proposed patchsets are the
|
||
`repair warning improvements
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
|
||
refactoring of the
|
||
`repair data dependency
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
|
||
and
|
||
`object tracking
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
|
||
and the
|
||
`repair scheduling
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
|
||
improvement series.
|
||
|
||
Checking Names for Confusable Unicode Sequences
|
||
-----------------------------------------------
|
||
|
||
If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
|
||
phase 4, it moves on to phase 5, which checks for suspicious looking names in
|
||
the filesystem.
|
||
These names consist of the filesystem label, names in directory entries, and
|
||
the names of extended attributes.
|
||
Like most Unix filesystems, XFS imposes the sparest of constraints on the
|
||
contents of a name:
|
||
|
||
- Slashes and null bytes are not allowed in directory entries.
|
||
|
||
- Null bytes are not allowed in userspace-visible extended attributes.
|
||
|
||
- Null bytes are not allowed in the filesystem label.
|
||
|
||
Directory entries and attribute keys store the length of the name explicitly
|
||
ondisk, which means that nulls are not name terminators.
|
||
For this section, the term "naming domain" refers to any place where names are
|
||
presented together -- all the names in a directory, or all the attributes of a
|
||
file.
|
||
|
||
Although the Unix naming constraints are very permissive, the reality of most
|
||
modern-day Linux systems is that programs work with Unicode character code
|
||
points to support international languages.
|
||
These programs typically encode those code points in UTF-8 when interfacing
|
||
with the C library because the kernel expects null-terminated names.
|
||
In the common case, therefore, names found in an XFS filesystem are actually
|
||
UTF-8 encoded Unicode data.
|
||
|
||
To maximize its expressiveness, the Unicode standard defines separate control
|
||
points for various characters that render similarly or identically in writing
|
||
systems around the world.
|
||
For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
|
||
identically to "Latin Small Letter A" U+0061 "a".
|
||
|
||
The standard also permits characters to be constructed in multiple ways --
|
||
either by using a defined code point, or by combining one code point with
|
||
various combining marks.
|
||
For example, the character "Angstrom Sign U+212B "Å" can also be expressed
|
||
as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
|
||
U+030A "◌̊".
|
||
Both sequences render identically.
|
||
|
||
Like the standards that preceded it, Unicode also defines various control
|
||
characters to alter the presentation of text.
|
||
For example, the character "Right-to-Left Override" U+202E can trick some
|
||
programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
|
||
A second category of rendering problems involves whitespace characters.
|
||
If the character "Zero Width Space" U+200B is encountered in a file name, the
|
||
name will render identically to a name that does not have the zero width
|
||
space.
|
||
|
||
If two names within a naming domain have different byte sequences but render
|
||
identically, a user may be confused by it.
|
||
The kernel, in its indifference to upper level encoding schemes, permits this.
|
||
Most filesystem drivers persist the byte sequence names that are given to them
|
||
by the VFS.
|
||
|
||
Techniques for detecting confusable names are explained in great detail in
|
||
sections 4 and 5 of the
|
||
`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
|
||
document.
|
||
When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
|
||
Unicode normalization form NFD in conjunction with the confusable name
|
||
detection component of
|
||
`libicu <https://github.com/unicode-org/icu>`_
|
||
to identify names with a directory or within a file's extended attributes that
|
||
could be confused for each other.
|
||
Names are also checked for control characters, non-rendering characters, and
|
||
mixing of bidirectional characters.
|
||
All of these potential issues are reported to the system administrator during
|
||
phase 5.
|
||
|
||
Media Verification of File Data Extents
|
||
---------------------------------------
|
||
|
||
The system administrator can elect to initiate a media scan of all file data
|
||
blocks.
|
||
This scan after validation of all filesystem metadata (except for the summary
|
||
counters) as phase 6.
|
||
The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
|
||
to find areas that are allocated to file data fork extents.
|
||
Gaps between data fork extents that are smaller than 64k are treated as if
|
||
they were data fork extents to reduce the command setup overhead.
|
||
When the space map scan accumulates a region larger than 32MB, a media
|
||
verification request is sent to the disk as a directio read of the raw block
|
||
device.
|
||
|
||
If the verification read fails, ``xfs_scrub`` retries with single-block reads
|
||
to narrow down the failure to the specific region of the media and recorded.
|
||
When it has finished issuing verification requests, it again uses the space
|
||
mapping ioctl to map the recorded media errors back to metadata structures
|
||
and report what has been lost.
|
||
For media errors in blocks owned by files, parent pointers can be used to
|
||
construct file paths from inode numbers for user-friendly reporting.
|
||
|
||
7. Conclusion and Future Work
|
||
=============================
|
||
|
||
It is hoped that the reader of this document has followed the designs laid out
|
||
in this document and now has some familiarity with how XFS performs online
|
||
rebuilding of its metadata indices, and how filesystem users can interact with
|
||
that functionality.
|
||
Although the scope of this work is daunting, it is hoped that this guide will
|
||
make it easier for code readers to understand what has been built, for whom it
|
||
has been built, and why.
|
||
Please feel free to contact the XFS mailing list with questions.
|
||
|
||
FIEXCHANGE_RANGE
|
||
----------------
|
||
|
||
As discussed earlier, a second frontend to the atomic extent swap mechanism is
|
||
a new ioctl call that userspace programs can use to commit updates to files
|
||
atomically.
|
||
This frontend has been out for review for several years now, though the
|
||
necessary refinements to online repair and lack of customer demand mean that
|
||
the proposal has not been pushed very hard.
|
||
|
||
Extent Swapping with Regular User Files
|
||
```````````````````````````````````````
|
||
|
||
As mentioned earlier, XFS has long had the ability to swap extents between
|
||
files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
|
||
The earliest form of this was the fork swap mechanism, where the entire
|
||
contents of data forks could be exchanged between two files by exchanging the
|
||
raw bytes in each inode fork's immediate area.
|
||
When XFS v5 came along with self-describing metadata, this old mechanism grew
|
||
some log support to continue rewriting the owner fields of BMBT blocks during
|
||
log recovery.
|
||
When the reverse mapping btree was later added to XFS, the only way to maintain
|
||
the consistency of the fork mappings with the reverse mapping index was to
|
||
develop an iterative mechanism that used deferred bmap and rmap operations to
|
||
swap mappings one at a time.
|
||
This mechanism is identical to steps 2-3 from the procedure above except for
|
||
the new tracking items, because the atomic extent swap mechanism is an
|
||
iteration of an existing mechanism and not something totally novel.
|
||
For the narrow case of file defragmentation, the file contents must be
|
||
identical, so the recovery guarantees are not much of a gain.
|
||
|
||
Atomic extent swapping is much more flexible than the existing swapext
|
||
implementations because it can guarantee that the caller never sees a mix of
|
||
old and new contents even after a crash, and it can operate on two arbitrary
|
||
file fork ranges.
|
||
The extra flexibility enables several new use cases:
|
||
|
||
- **Atomic commit of file writes**: A userspace process opens a file that it
|
||
wants to update.
|
||
Next, it opens a temporary file and calls the file clone operation to reflink
|
||
the first file's contents into the temporary file.
|
||
Writes to the original file should instead be written to the temporary file.
|
||
Finally, the process calls the atomic extent swap system call
|
||
(``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
|
||
of the updates to the original file, or none of them.
|
||
|
||
.. _swapext_if_unchanged:
|
||
|
||
- **Transactional file updates**: The same mechanism as above, but the caller
|
||
only wants the commit to occur if the original file's contents have not
|
||
changed.
|
||
To make this happen, the calling process snapshots the file modification and
|
||
change timestamps of the original file before reflinking its data to the
|
||
temporary file.
|
||
When the program is ready to commit the changes, it passes the timestamps
|
||
into the kernel as arguments to the atomic extent swap system call.
|
||
The kernel only commits the changes if the provided timestamps match the
|
||
original file.
|
||
|
||
- **Emulation of atomic block device writes**: Export a block device with a
|
||
logical sector size matching the filesystem block size to force all writes
|
||
to be aligned to the filesystem block size.
|
||
Stage all writes to a temporary file, and when that is complete, call the
|
||
atomic extent swap system call with a flag to indicate that holes in the
|
||
temporary file should be ignored.
|
||
This emulates an atomic device write in software, and can support arbitrary
|
||
scattered writes.
|
||
|
||
Vectorized Scrub
|
||
----------------
|
||
|
||
As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
|
||
earlier was a catalyst for enabling a vectorized scrub system call.
|
||
Since 2018, the cost of making a kernel call has increased considerably on some
|
||
systems to mitigate the effects of speculative execution attacks.
|
||
This incentivizes program authors to make as few system calls as possible to
|
||
reduce the number of times an execution path crosses a security boundary.
|
||
|
||
With vectorized scrub, userspace pushes to the kernel the identity of a
|
||
filesystem object, a list of scrub types to run against that object, and a
|
||
simple representation of the data dependencies between the selected scrub
|
||
types.
|
||
The kernel executes as much of the caller's plan as it can until it hits a
|
||
dependency that cannot be satisfied due to a corruption, and tells userspace
|
||
how much was accomplished.
|
||
It is hoped that ``io_uring`` will pick up enough of this functionality that
|
||
online fsck can use that instead of adding a separate vectored scrub system
|
||
call to XFS.
|
||
|
||
The relevant patchsets are the
|
||
`kernel vectorized scrub
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
|
||
and
|
||
`userspace vectorized scrub
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
|
||
series.
|
||
|
||
Quality of Service Targets for Scrub
|
||
------------------------------------
|
||
|
||
One serious shortcoming of the online fsck code is that the amount of time that
|
||
it can spend in the kernel holding resource locks is basically unbounded.
|
||
Userspace is allowed to send a fatal signal to the process which will cause
|
||
``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
|
||
for userspace to provide a time budget to the kernel.
|
||
Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
|
||
be too much work to allow userspace to specify a timeout for a scrub/repair
|
||
operation and abort the operation if it exceeds budget.
|
||
However, most repair functions have the property that once they begin to touch
|
||
ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
|
||
timeout is no longer useful.
|
||
|
||
Defragmenting Free Space
|
||
------------------------
|
||
|
||
Over the years, many XFS users have requested the creation of a program to
|
||
clear a portion of the physical storage underlying a filesystem so that it
|
||
becomes a contiguous chunk of free space.
|
||
Call this free space defragmenter ``clearspace`` for short.
|
||
|
||
The first piece the ``clearspace`` program needs is the ability to read the
|
||
reverse mapping index from userspace.
|
||
This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
|
||
The second piece it needs is a new fallocate mode
|
||
(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
|
||
maps it to a file.
|
||
Call this file the "space collector" file.
|
||
The third piece is the ability to force an online repair.
|
||
|
||
To clear all the metadata out of a portion of physical storage, clearspace
|
||
uses the new fallocate map-freespace call to map any free space in that region
|
||
to the space collector file.
|
||
Next, clearspace finds all metadata blocks in that region by way of
|
||
``GETFSMAP`` and issues forced repair requests on the data structure.
|
||
This often results in the metadata being rebuilt somewhere that is not being
|
||
cleared.
|
||
After each relocation, clearspace calls the "map free space" function again to
|
||
collect any newly freed space in the region being cleared.
|
||
|
||
To clear all the file data out of a portion of the physical storage, clearspace
|
||
uses the FSMAP information to find relevant file data blocks.
|
||
Having identified a good target, it uses the ``FICLONERANGE`` call on that part
|
||
of the file to try to share the physical space with a dummy file.
|
||
Cloning the extent means that the original owners cannot overwrite the
|
||
contents; any changes will be written somewhere else via copy-on-write.
|
||
Clearspace makes its own copy of the frozen extent in an area that is not being
|
||
cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
|
||
<swapext_if_unchanged>` feature) to change the target file's data extent
|
||
mapping away from the area being cleared.
|
||
When all other mappings have been moved, clearspace reflinks the space into the
|
||
space collector file so that it becomes unavailable.
|
||
|
||
There are further optimizations that could apply to the above algorithm.
|
||
To clear a piece of physical storage that has a high sharing factor, it is
|
||
strongly desirable to retain this sharing factor.
|
||
In fact, these extents should be moved first to maximize sharing factor after
|
||
the operation completes.
|
||
To make this work smoothly, clearspace needs a new ioctl
|
||
(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
|
||
With the refcount information exposed, clearspace can quickly find the longest,
|
||
most shared data extents in the filesystem, and target them first.
|
||
|
||
**Future Work Question**: How might the filesystem move inode chunks?
|
||
|
||
*Answer*: To move inode chunks, Dave Chinner constructed a prototype program
|
||
that creates a new file with the old contents and then locklessly runs around
|
||
the filesystem updating directory entries.
|
||
The operation cannot complete if the filesystem goes down.
|
||
That problem isn't totally insurmountable: create an inode remapping table
|
||
hidden behind a jump label, and a log item that tracks the kernel walking the
|
||
filesystem to update directory entries.
|
||
The trouble is, the kernel can't do anything about open files, since it cannot
|
||
revoke them.
|
||
|
||
**Future Work Question**: Can static keys be used to minimize the cost of
|
||
supporting ``revoke()`` on XFS files?
|
||
|
||
*Answer*: Yes.
|
||
Until the first revocation, the bailout code need not be in the call path at
|
||
all.
|
||
|
||
The relevant patchsets are the
|
||
`kernel freespace defrag
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
|
||
and
|
||
`userspace freespace defrag
|
||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
|
||
series.
|
||
|
||
Shrinking Filesystems
|
||
---------------------
|
||
|
||
Removing the end of the filesystem ought to be a simple matter of evacuating
|
||
the data and metadata at the end of the filesystem, and handing the freed space
|
||
to the shrink code.
|
||
That requires an evacuation of the space at end of the filesystem, which is a
|
||
use of free space defragmentation!
|