path: root/Documentation/device-mapper/cache.txt
diff options
Diffstat (limited to 'Documentation/device-mapper/cache.txt')
1 files changed, 243 insertions, 0 deletions
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt
new file mode 100644
index 00000000..f50470ab
--- /dev/null
+++ b/Documentation/device-mapper/cache.txt
@@ -0,0 +1,243 @@
+dm-cache is a device mapper target written by Joe Thornber, Heinz
+Mauelshagen, and Mike Snitzer.
+It aims to improve performance of a block device (eg, a spindle) by
+dynamically migrating some of its data to a faster, smaller device
+(eg, an SSD).
+This device-mapper solution allows us to insert this caching at
+different levels of the dm stack, for instance above the data device for
+a thin-provisioning pool. Caching solutions that are integrated more
+closely with the virtual memory system should give better performance.
+The target reuses the metadata library used in the thin-provisioning
+The decision as to what data to migrate and when is left to a plug-in
+policy module. Several of these have been written as we experiment,
+and we hope other people will contribute others for specific io
+scenarios (eg. a vm image server).
+ Migration - Movement of the primary copy of a logical block from one
+ device to the other.
+ Promotion - Migration from slow device to fast device.
+ Demotion - Migration from fast device to slow device.
+The origin device always contains a copy of the logical block, which
+may be out of date or kept in sync with the copy on the cache device
+(depending on policy).
+The target is constructed by passing three devices to it (along with
+other parameters detailed later):
+1. An origin device - the big, slow one.
+2. A cache device - the small, fast one.
+3. A small metadata device - records which blocks are in the cache,
+ which are dirty, and extra hints for use by the policy object.
+ This information could be put on the cache device, but having it
+ separate allows the volume manager to configure it differently,
+ e.g. as a mirror for extra robustness.
+Fixed block size
+The origin is divided up into blocks of a fixed size. This block size
+is configurable when you first create the cache. Typically we've been
+using block sizes of 256k - 1024k.
+Having a fixed block size simplifies the target a lot. But it is
+something of a compromise. For instance, a small part of a block may be
+getting hit a lot, yet the whole block will be promoted to the cache.
+So large block sizes are bad because they waste cache space. And small
+block sizes are bad because they increase the amount of metadata (both
+in core and on disk).
+The cache has two modes, writeback and writethrough.
+If writeback, the default, is selected then a write to a block that is
+cached will go only to the cache and the block will be marked dirty in
+the metadata.
+If writethrough is selected then a write to a cached block will not
+complete until it has hit both the origin and cache devices. Clean
+blocks should remain clean.
+A simple cleaner policy is provided, which will clean (write back) all
+dirty blocks in a cache. Useful for decommissioning a cache.
+Migration throttling
+Migrating data between the origin and cache device uses bandwidth.
+The user can set a throttle to prevent more than a certain amount of
+migration occuring at any one time. Currently we're not taking any
+account of normal io traffic going to the devices. More work needs
+doing here to avoid migrating during those peak io moments.
+For the time being, a message "migration_threshold <#sectors>"
+can be used to set the maximum number of sectors being migrated,
+the default being 204800 sectors (or 100MB).
+Updating on-disk metadata
+On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is
+written. If no such requests are made then commits will occur every
+second. This means the cache behaves like a physical disk that has a
+write cache (the same is true of the thin-provisioning target). If
+power is lost you may lose some recent writes. The metadata should
+always be consistent in spite of any crash.
+The 'dirty' state for a cache block changes far too frequently for us
+to keep updating it on the fly. So we treat it as a hint. In normal
+operation it will be written when the dm device is suspended. If the
+system crashes all cache blocks will be assumed dirty when restarted.
+Per-block policy hints
+Policy plug-ins can store a chunk of data per cache block. It's up to
+the policy how big this chunk is, but it should be kept small. Like the
+dirty flags this data is lost if there's a crash so a safe fallback
+value should always be possible.
+For instance, the 'mq' policy, which is currently the default policy,
+uses this facility to store the hit count of the cache blocks. If
+there's a crash this information will be lost, which means the cache
+may be less efficient until those hit counts are regenerated.
+Policy hints affect performance, not correctness.
+Policy messaging
+Policies will have different tunables, specific to each one, so we
+need a generic way of getting and setting these. Device-mapper
+messages are used. Refer to cache-policies.txt.
+Discard bitset resolution
+We can avoid copying data during migration if we know the block has
+been discarded. A prime example of this is when mkfs discards the
+whole block device. We store a bitset tracking the discard state of
+blocks. However, we allow this bitset to have a different block size
+from the cache blocks. This is because we need to track the discard
+state for all of the origin device (compare with the dirty bitset
+which is just for the smaller cache device).
+Target interface
+ cache <metadata dev> <cache dev> <origin dev> <block size>
+ <#feature args> [<feature arg>]*
+ <policy> <#policy args> [policy args]*
+ metadata dev : fast device holding the persistent metadata
+ cache dev : fast device holding cached data blocks
+ origin dev : slow device holding original data blocks
+ block size : cache unit size in sectors
+ #feature args : number of feature arguments passed
+ feature args : writethrough. (The default is writeback.)
+ policy : the replacement policy to use
+ #policy args : an even number of arguments corresponding to
+ key/value pairs passed to the policy
+ policy args : key/value pairs passed to the policy
+ E.g. 'sequential_threshold 1024'
+ See cache-policies.txt for details.
+Optional feature arguments are:
+ writethrough : write through caching that prohibits cache block
+ content from being different from origin block content.
+ Without this argument, the default behaviour is to write
+ back cache block contents later for performance reasons,
+ so they may differ from the corresponding origin blocks.
+A policy called 'default' is always registered. This is an alias for
+the policy we currently think is giving best all round performance.
+As the default policy could vary between kernels, if you are relying on
+the characteristics of a specific policy, always request it by name.
+<#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses>
+<#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache>
+<#dirty> <#features> <features>* <#core args> <core args>* <#policy args>
+<policy args>*
+#used metadata blocks : Number of metadata blocks used
+#total metadata blocks : Total number of metadata blocks
+#read hits : Number of times a READ bio has been mapped
+ to the cache
+#read misses : Number of times a READ bio has been mapped
+ to the origin
+#write hits : Number of times a WRITE bio has been mapped
+ to the cache
+#write misses : Number of times a WRITE bio has been
+ mapped to the origin
+#demotions : Number of times a block has been removed
+ from the cache
+#promotions : Number of times a block has been moved to
+ the cache
+#blocks in cache : Number of blocks resident in the cache
+#dirty : Number of blocks in the cache that differ
+ from the origin
+#feature args : Number of feature args to follow
+feature args : 'writethrough' (optional)
+#core args : Number of core arguments (must be even)
+core args : Key/value pairs for tuning the core
+ e.g. migration_threshold
+#policy args : Number of policy arguments to follow (must be even)
+policy args : Key/value pairs
+ e.g. 'sequential_threshold 1024
+Policies will have different tunables, specific to each one, so we
+need a generic way of getting and setting these. Device-mapper
+messages are used. (A sysfs interface would also be possible.)
+The message format is:
+ <key> <value>
+ dmsetup message my_cache 0 sequential_threshold 1024
+The test suite can be found here:
+dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
+ /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
+dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
+ /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
+ mq 4 sequential_threshold 1024 random_threshold 8'