diff options
Diffstat (limited to 'Documentation/RCU/Design/Requirements')
4 files changed, 2809 insertions, 3379 deletions
diff --git a/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg index 4b4014fda770..87851a8fac1e 100644 --- a/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg +++ b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg @@ -88,7 +88,7 @@ <flowRoot xml:space="preserve" id="flowRoot2985" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion id="flowRegion2987"><rect id="rect2989" width="82.85714" @@ -103,7 +103,7 @@ id="text2993" y="-261.66608" x="412.12299" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" xml:space="preserve" transform="matrix(0,1,-1,0,0,0)"><tspan y="-261.66608" @@ -135,7 +135,7 @@ </g> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.04738" y="268.18076" id="text4429" @@ -146,7 +146,7 @@ y="268.18076">WRITE_ONCE(a, 1);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.04738" y="439.13766" id="text4441" @@ -157,7 +157,7 @@ y="439.13766">WRITE_ONCE(b, 1);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="255.60869" y="309.29346" id="text4445" @@ -168,7 +168,7 @@ y="309.29346">r1 = READ_ONCE(a);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="255.14423" y="520.61786" id="text4449" @@ -179,7 +179,7 @@ y="520.61786">WRITE_ONCE(c, 1);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="384.71124" id="text4453" @@ -190,7 +190,7 @@ y="384.71124">r2 = READ_ONCE(b);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="582.13617" id="text4457" @@ -201,7 +201,7 @@ y="582.13617">r3 = READ_ONCE(c);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.08231" y="213.91006" id="text4461" @@ -212,7 +212,7 @@ y="213.91006">thread0()</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="252.34512" y="213.91006" id="text4461-6" @@ -223,7 +223,7 @@ y="213.91006">thread1()</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.42557" y="213.91006" id="text4461-2" @@ -251,7 +251,7 @@ inkscape:connector-curvature="0" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="111.75929" y="251.53981" id="text4429-8" @@ -262,7 +262,7 @@ y="251.53981">rcu_read_lock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="367.91556" id="text4429-8-9" @@ -273,7 +273,7 @@ y="367.91556">rcu_read_lock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="597.40289" id="text4429-8-9-3" @@ -284,7 +284,7 @@ y="597.40289">rcu_read_unlock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="111.75929" y="453.15311" id="text4429-8-9-3-1" @@ -300,7 +300,7 @@ inkscape:connector-curvature="0" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="394.94427" y="345.66351" id="text4648" @@ -324,7 +324,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.11968" y="475.77856" id="text4648-4" @@ -361,7 +361,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="254.85066" y="348.96619" id="text4648-4-3" diff --git a/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg index 48cd1623d4d4..e2a8af592bab 100644 --- a/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg +++ b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg @@ -116,7 +116,7 @@ <flowRoot xml:space="preserve" id="flowRoot2985" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion id="flowRegion2987"><rect id="rect2989" width="82.85714" @@ -131,7 +131,7 @@ id="text2993" y="-261.66608" x="436.12299" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" xml:space="preserve" transform="matrix(0,1,-1,0,0,0)"><tspan y="-261.66608" @@ -163,7 +163,7 @@ </g> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.04738" y="268.18076" id="text4429" @@ -174,7 +174,7 @@ y="268.18076">WRITE_ONCE(a, 1);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.04738" y="487.13766" id="text4441" @@ -185,7 +185,7 @@ y="487.13766">WRITE_ONCE(b, 1);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="255.60869" y="297.29346" id="text4445" @@ -196,7 +196,7 @@ y="297.29346">r1 = READ_ONCE(a);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="255.14423" y="554.61786" id="text4449" @@ -207,7 +207,7 @@ y="554.61786">WRITE_ONCE(c, 1);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="370.71124" id="text4453" @@ -218,7 +218,7 @@ y="370.71124">WRITE_ONCE(d, 1);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="572.13617" id="text4457" @@ -229,7 +229,7 @@ y="572.13617">r2 = READ_ONCE(c);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.08231" y="213.91006" id="text4461" @@ -240,7 +240,7 @@ y="213.91006">thread0()</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="252.34512" y="213.91006" id="text4461-6" @@ -251,7 +251,7 @@ y="213.91006">thread1()</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.42557" y="213.91006" id="text4461-2" @@ -281,7 +281,7 @@ sodipodi:nodetypes="cc" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="111.75929" y="251.53981" id="text4429-8" @@ -292,7 +292,7 @@ y="251.53981">rcu_read_lock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="353.91556" id="text4429-8-9" @@ -303,7 +303,7 @@ y="353.91556">rcu_read_lock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="396.10254" y="587.40289" id="text4429-8-9-3" @@ -314,7 +314,7 @@ y="587.40289">rcu_read_unlock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="111.75929" y="501.15311" id="text4429-8-9-3-1" @@ -331,7 +331,7 @@ sodipodi:nodetypes="cc" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="394.94427" y="331.66351" id="text4648" @@ -355,7 +355,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="112.11968" y="523.77856" id="text4648-4" @@ -392,7 +392,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="254.85066" y="336.96619" id="text4648-4-3" @@ -421,7 +421,7 @@ id="text2993-7" y="-261.66608" x="440.12299" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" xml:space="preserve" transform="matrix(0,1,-1,0,0,0)"><tspan y="-261.66608" @@ -453,7 +453,7 @@ </g> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="541.70508" y="387.6217" id="text4445-0" @@ -464,7 +464,7 @@ y="387.6217">r3 = READ_ONCE(d);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="541.2406" y="646.94611" id="text4449-6" @@ -488,7 +488,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="540.94702" y="427.29443" id="text4648-4-3-1" @@ -499,7 +499,7 @@ y="427.29443">QS</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="686.27747" y="461.83929" id="text4453-7" @@ -510,7 +510,7 @@ y="461.83929">r4 = READ_ONCE(b);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="686.27747" y="669.26422" id="text4457-9" @@ -521,7 +521,7 @@ y="669.26422">r5 = READ_ONCE(e);</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="686.27747" y="445.04358" id="text4429-8-9-33" @@ -532,7 +532,7 @@ y="445.04358">rcu_read_lock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="686.27747" y="684.53094" id="text4429-8-9-3-8" @@ -543,7 +543,7 @@ y="684.53094">rcu_read_unlock();</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="685.11914" y="422.79153" id="text4648-9" @@ -567,7 +567,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="397.85934" y="609.59003" id="text4648-5" @@ -591,7 +591,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="256.75986" y="586.99133" id="text4648-5-2" @@ -615,7 +615,7 @@ sodipodi:open="true" /> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="546.22791" y="213.91006" id="text4461-2-5" @@ -626,7 +626,7 @@ y="213.91006">thread3()</tspan></text> <text xml:space="preserve" - style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol" + style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace" x="684.00067" y="213.91006" id="text4461-2-1" diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html deleted file mode 100644 index 5a9238a2883c..000000000000 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ /dev/null @@ -1,3330 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through RCU's Requirements [LWN.net]</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> - -<h1>A Tour Through RCU's Requirements</h1> - -<p>Copyright IBM Corporation, 2015</p> -<p>Author: Paul E. McKenney</p> -<p><i>The initial version of this document appeared in the -<a href="https://lwn.net/">LWN</a> articles -<a href="https://lwn.net/Articles/652156/">here</a>, -<a href="https://lwn.net/Articles/652677/">here</a>, and -<a href="https://lwn.net/Articles/653326/">here</a>.</i></p> - -<h2>Introduction</h2> - -<p> -Read-copy update (RCU) is a synchronization mechanism that is often -used as a replacement for reader-writer locking. -RCU is unusual in that updaters do not block readers, -which means that RCU's read-side primitives can be exceedingly fast -and scalable. -In addition, updaters can make useful forward progress concurrently -with readers. -However, all this concurrency between RCU readers and updaters does raise -the question of exactly what RCU readers are doing, which in turn -raises the question of exactly what RCU's requirements are. - -<p> -This document therefore summarizes RCU's requirements, and can be thought -of as an informal, high-level specification for RCU. -It is important to understand that RCU's specification is primarily -empirical in nature; -in fact, I learned about many of these requirements the hard way. -This situation might cause some consternation, however, not only -has this learning process been a lot of fun, but it has also been -a great privilege to work with so many people willing to apply -technologies in interesting new ways. - -<p> -All that aside, here are the categories of currently known RCU requirements: -</p> - -<ol> -<li> <a href="#Fundamental Requirements"> - Fundamental Requirements</a> -<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> -<li> <a href="#Parallelism Facts of Life"> - Parallelism Facts of Life</a> -<li> <a href="#Quality-of-Implementation Requirements"> - Quality-of-Implementation Requirements</a> -<li> <a href="#Linux Kernel Complications"> - Linux Kernel Complications</a> -<li> <a href="#Software-Engineering Requirements"> - Software-Engineering Requirements</a> -<li> <a href="#Other RCU Flavors"> - Other RCU Flavors</a> -<li> <a href="#Possible Future Changes"> - Possible Future Changes</a> -</ol> - -<p> -This is followed by a <a href="#Summary">summary</a>, -however, the answers to each quick quiz immediately follows the quiz. -Select the big white space with your mouse to see the answer. - -<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> - -<p> -RCU's fundamental requirements are the closest thing RCU has to hard -mathematical requirements. -These are: - -<ol> -<li> <a href="#Grace-Period Guarantee"> - Grace-Period Guarantee</a> -<li> <a href="#Publish-Subscribe Guarantee"> - Publish-Subscribe Guarantee</a> -<li> <a href="#Memory-Barrier Guarantees"> - Memory-Barrier Guarantees</a> -<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally"> - RCU Primitives Guaranteed to Execute Unconditionally</a> -<li> <a href="#Guaranteed Read-to-Write Upgrade"> - Guaranteed Read-to-Write Upgrade</a> -</ol> - -<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> - -<p> -RCU's grace-period guarantee is unusual in being premeditated: -Jack Slingwine and I had this guarantee firmly in mind when we started -work on RCU (then called “rclock”) in the early 1990s. -That said, the past two decades of experience with RCU have produced -a much more detailed understanding of this guarantee. - -<p> -RCU's grace-period guarantee allows updaters to wait for the completion -of all pre-existing RCU read-side critical sections. -An RCU read-side critical section -begins with the marker <tt>rcu_read_lock()</tt> and ends with -the marker <tt>rcu_read_unlock()</tt>. -These markers may be nested, and RCU treats a nested set as one -big RCU read-side critical section. -Production-quality implementations of <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> are extremely lightweight, and in -fact have exactly zero overhead in Linux kernels built for production -use with <tt>CONFIG_PREEMPT=n</tt>. - -<p> -This guarantee allows ordering to be enforced with extremely low -overhead to readers, for example: - -<blockquote> -<pre> - 1 int x, y; - 2 - 3 void thread0(void) - 4 { - 5 rcu_read_lock(); - 6 r1 = READ_ONCE(x); - 7 r2 = READ_ONCE(y); - 8 rcu_read_unlock(); - 9 } -10 -11 void thread1(void) -12 { -13 WRITE_ONCE(x, 1); -14 synchronize_rcu(); -15 WRITE_ONCE(y, 1); -16 } -</pre> -</blockquote> - -<p> -Because the <tt>synchronize_rcu()</tt> on line 14 waits for -all pre-existing readers, any instance of <tt>thread0()</tt> that -loads a value of zero from <tt>x</tt> must complete before -<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must -also load a value of zero from <tt>y</tt>. -Similarly, any instance of <tt>thread0()</tt> that loads a value of -one from <tt>y</tt> must have started after the -<tt>synchronize_rcu()</tt> started, and must therefore also load -a value of one from <tt>x</tt>. -Therefore, the outcome: -<blockquote> -<pre> -(r1 == 0 && r2 == 1) -</pre> -</blockquote> -cannot happen. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - You said that updaters can make useful forward progress concurrently - with readers, but pre-existing readers will block - <tt>synchronize_rcu()</tt>!!! - Just who are you trying to fool??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - First, if updaters do not wish to be blocked by readers, they can use - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will - be discussed later. - Second, even when using <tt>synchronize_rcu()</tt>, the other - update-side code does run concurrently with readers, whether - pre-existing or not. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -This scenario resembles one of the first uses of RCU in -<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, -which managed a distributed lock manager's transition into -a state suitable for handling recovery from node failure, -more or less as follows: - -<blockquote> -<pre> - 1 #define STATE_NORMAL 0 - 2 #define STATE_WANT_RECOVERY 1 - 3 #define STATE_RECOVERING 2 - 4 #define STATE_WANT_NORMAL 3 - 5 - 6 int state = STATE_NORMAL; - 7 - 8 void do_something_dlm(void) - 9 { -10 int state_snap; -11 -12 rcu_read_lock(); -13 state_snap = READ_ONCE(state); -14 if (state_snap == STATE_NORMAL) -15 do_something(); -16 else -17 do_something_carefully(); -18 rcu_read_unlock(); -19 } -20 -21 void start_recovery(void) -22 { -23 WRITE_ONCE(state, STATE_WANT_RECOVERY); -24 synchronize_rcu(); -25 WRITE_ONCE(state, STATE_RECOVERING); -26 recovery(); -27 WRITE_ONCE(state, STATE_WANT_NORMAL); -28 synchronize_rcu(); -29 WRITE_ONCE(state, STATE_NORMAL); -30 } -</pre> -</blockquote> - -<p> -The RCU read-side critical section in <tt>do_something_dlm()</tt> -works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> -to guarantee that <tt>do_something()</tt> never runs concurrently -with <tt>recovery()</tt>, but with little or no synchronization -overhead in <tt>do_something_dlm()</tt>. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why is the <tt>synchronize_rcu()</tt> on line 28 needed? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Without that extra grace period, memory reordering could result in - <tt>do_something_dlm()</tt> executing <tt>do_something()</tt> - concurrently with the last bits of <tt>recovery()</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -In order to avoid fatal problems such as deadlocks, -an RCU read-side critical section must not contain calls to -<tt>synchronize_rcu()</tt>. -Similarly, an RCU read-side critical section must not -contain anything that waits, directly or indirectly, on completion of -an invocation of <tt>synchronize_rcu()</tt>. - -<p> -Although RCU's grace-period guarantee is useful in and of itself, with -<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, -it would be good to be able to use RCU to coordinate read-side -access to linked data structures. -For this, the grace-period guarantee is not sufficient, as can -be seen in function <tt>add_gp_buggy()</tt> below. -We will look at the reader's code later, but in the meantime, just think of -the reader as locklessly picking up the <tt>gp</tt> pointer, -and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the -<tt>->a</tt> and <tt>->b</tt> fields. - -<blockquote> -<pre> - 1 bool add_gp_buggy(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 p->a = a; -12 p->b = a; -13 gp = p; /* ORDERING BUG */ -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -The problem is that both the compiler and weakly ordered CPUs are within -their rights to reorder this code as follows: - -<blockquote> -<pre> - 1 bool add_gp_buggy_optimized(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -<b>11 gp = p; /* ORDERING BUG */ -12 p->a = a; -13 p->b = a;</b> -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -If an RCU reader fetches <tt>gp</tt> just after -<tt>add_gp_buggy_optimized</tt> executes line 11, -it will see garbage in the <tt>->a</tt> and <tt>->b</tt> -fields. -And this is but one of many ways in which compiler and hardware optimizations -could cause trouble. -Therefore, we clearly need some way to prevent the compiler and the CPU from -reordering in this manner, which brings us to the publish-subscribe -guarantee discussed in the next section. - -<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> - -<p> -RCU's publish-subscribe guarantee allows data to be inserted -into a linked data structure without disrupting RCU readers. -The updater uses <tt>rcu_assign_pointer()</tt> to insert the -new data, and readers use <tt>rcu_dereference()</tt> to -access data, whether new or old. -The following shows an example of insertion: - -<blockquote> -<pre> - 1 bool add_gp(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 p->a = a; -12 p->b = a; -13 rcu_assign_pointer(gp, p); -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually -equivalent to a simple assignment statement, but also guarantees -that its assignment will -happen after the two assignments in lines 11 and 12, -similar to the C11 <tt>memory_order_release</tt> store operation. -It also prevents any number of “interesting” compiler -optimizations, for example, the use of <tt>gp</tt> as a scratch -location immediately preceding the assignment. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But <tt>rcu_assign_pointer()</tt> does nothing to prevent the - two assignments to <tt>p->a</tt> and <tt>p->b</tt> - from being reordered. - Can't that also cause problems? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, it cannot. - The readers cannot see either of these two fields until - the assignment to <tt>gp</tt>, by which time both fields are - fully initialized. - So reordering the assignments - to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly - cause any problems. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -It is tempting to assume that the reader need not do anything special -to control its accesses to the RCU-protected data, -as shown in <tt>do_something_gp_buggy()</tt> below: - -<blockquote> -<pre> - 1 bool do_something_gp_buggy(void) - 2 { - 3 rcu_read_lock(); - 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ - 5 if (p) { - 6 do_something(p->a, p->b); - 7 rcu_read_unlock(); - 8 return true; - 9 } -10 rcu_read_unlock(); -11 return false; -12 } -</pre> -</blockquote> - -<p> -However, this temptation must be resisted because there are a -surprisingly large number of ways that the compiler -(to say nothing of -<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) -can trip this code up. -For but one example, if the compiler were short of registers, it -might choose to refetch from <tt>gp</tt> rather than keeping -a separate copy in <tt>p</tt> as follows: - -<blockquote> -<pre> - 1 bool do_something_gp_buggy_optimized(void) - 2 { - 3 rcu_read_lock(); - 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ -<b> 5 do_something(gp->a, gp->b);</b> - 6 rcu_read_unlock(); - 7 return true; - 8 } - 9 rcu_read_unlock(); -10 return false; -11 } -</pre> -</blockquote> - -<p> -If this function ran concurrently with a series of updates that -replaced the current structure with a new one, -the fetches of <tt>gp->a</tt> -and <tt>gp->b</tt> might well come from two different structures, -which could cause serious confusion. -To prevent this (and much else besides), <tt>do_something_gp()</tt> uses -<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: - -<blockquote> -<pre> - 1 bool do_something_gp(void) - 2 { - 3 rcu_read_lock(); - 4 p = rcu_dereference(gp); - 5 if (p) { - 6 do_something(p->a, p->b); - 7 rcu_read_unlock(); - 8 return true; - 9 } -10 rcu_read_unlock(); -11 return false; -12 } -</pre> -</blockquote> - -<p> -The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) -memory barriers in the Linux kernel. -Should a -<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> -ever appear, then <tt>rcu_dereference()</tt> could be implemented -as a <tt>memory_order_consume</tt> load. -Regardless of the exact implementation, a pointer fetched by -<tt>rcu_dereference()</tt> may not be used outside of the -outermost RCU read-side critical section containing that -<tt>rcu_dereference()</tt>, unless protection of -the corresponding data element has been passed from RCU to some -other synchronization mechanism, most commonly locking or -<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. - -<p> -In short, updaters use <tt>rcu_assign_pointer()</tt> and readers -use <tt>rcu_dereference()</tt>, and these two RCU API elements -work together to ensure that readers have a consistent view of -newly added data elements. - -<p> -Of course, it is also necessary to remove elements from RCU-protected -data structures, for example, using the following process: - -<ol> -<li> Remove the data element from the enclosing structure. -<li> Wait for all pre-existing RCU read-side critical sections - to complete (because only pre-existing readers can possibly have - a reference to the newly removed data element). -<li> At this point, only the updater has a reference to the - newly removed data element, so it can safely reclaim - the data element, for example, by passing it to <tt>kfree()</tt>. -</ol> - -This process is implemented by <tt>remove_gp_synchronous()</tt>: - -<blockquote> -<pre> - 1 bool remove_gp_synchronous(void) - 2 { - 3 struct foo *p; - 4 - 5 spin_lock(&gp_lock); - 6 p = rcu_access_pointer(gp); - 7 if (!p) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 rcu_assign_pointer(gp, NULL); -12 spin_unlock(&gp_lock); -13 synchronize_rcu(); -14 kfree(p); -15 return true; -16 } -</pre> -</blockquote> - -<p> -This function is straightforward, with line 13 waiting for a grace -period before line 14 frees the old data element. -This waiting ensures that readers will reach line 7 of -<tt>do_something_gp()</tt> before the data element referenced by -<tt>p</tt> is freed. -The <tt>rcu_access_pointer()</tt> on line 6 is similar to -<tt>rcu_dereference()</tt>, except that: - -<ol> -<li> The value returned by <tt>rcu_access_pointer()</tt> - cannot be dereferenced. - If you want to access the value pointed to as well as - the pointer itself, use <tt>rcu_dereference()</tt> - instead of <tt>rcu_access_pointer()</tt>. -<li> The call to <tt>rcu_access_pointer()</tt> need not be - protected. - In contrast, <tt>rcu_dereference()</tt> must either be - within an RCU read-side critical section or in a code - segment where the pointer cannot change, for example, in - code protected by the corresponding update-side lock. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Without the <tt>rcu_dereference()</tt> or the - <tt>rcu_access_pointer()</tt>, what destructive optimizations - might the compiler make use of? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Let's start with what happens to <tt>do_something_gp()</tt> - if it fails to use <tt>rcu_dereference()</tt>. - It could reuse a value formerly fetched from this same pointer. - It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time - manner, resulting in <i>load tearing</i>, in turn resulting a bytewise - mash-up of two distinct pointer values. - It might even use value-speculation optimizations, where it makes - a wrong guess, but by the time it gets around to checking the - value, an update has changed the pointer to match the wrong guess. - Too bad about any dereferences that returned pre-initialization garbage - in the meantime! - </font> - - <p><font color="ffffff"> - For <tt>remove_gp_synchronous()</tt>, as long as all modifications - to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, - the above optimizations are harmless. - However, <tt>sparse</tt> will complain if you - define <tt>gp</tt> with <tt>__rcu</tt> and then - access it without using - either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -In short, RCU's publish-subscribe guarantee is provided by the combination -of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. -This guarantee allows data elements to be safely added to RCU-protected -linked data structures without disrupting RCU readers. -This guarantee can be used in combination with the grace-period -guarantee to also allow data elements to be removed from RCU-protected -linked data structures, again without disrupting RCU readers. - -<p> -This guarantee was only partially premeditated. -DYNIX/ptx used an explicit memory barrier for publication, but had nothing -resembling <tt>rcu_dereference()</tt> for subscription, nor did it -have anything resembling the <tt>smp_read_barrier_depends()</tt> -that was later subsumed into <tt>rcu_dereference()</tt> and later -still into <tt>READ_ONCE()</tt>. -The need for these operations made itself known quite suddenly at a -late-1990s meeting with the DEC Alpha architects, back in the days when -DEC was still a free-standing company. -It took the Alpha architects a good hour to convince me that any sort -of barrier would ever be needed, and it then took me a good <i>two</i> hours -to convince them that their documentation did not make this point clear. -More recent work with the C and C++ standards committees have provided -much education on tricks and traps from the compiler. -In short, compilers were much less tricky in the early 1990s, but in -2015, don't even think about omitting <tt>rcu_dereference()</tt>! - -<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> - -<p> -The previous section's simple linked-data-structure scenario clearly -demonstrates the need for RCU's stringent memory-ordering guarantees on -systems with more than one CPU: - -<ol> -<li> Each CPU that has an RCU read-side critical section that - begins before <tt>synchronize_rcu()</tt> starts is - guaranteed to execute a full memory barrier between the time - that the RCU read-side critical section ends and the time that - <tt>synchronize_rcu()</tt> returns. - Without this guarantee, a pre-existing RCU read-side critical section - might hold a reference to the newly removed <tt>struct foo</tt> - after the <tt>kfree()</tt> on line 14 of - <tt>remove_gp_synchronous()</tt>. -<li> Each CPU that has an RCU read-side critical section that ends - after <tt>synchronize_rcu()</tt> returns is guaranteed - to execute a full memory barrier between the time that - <tt>synchronize_rcu()</tt> begins and the time that the RCU - read-side critical section begins. - Without this guarantee, a later RCU read-side critical section - running after the <tt>kfree()</tt> on line 14 of - <tt>remove_gp_synchronous()</tt> might - later run <tt>do_something_gp()</tt> and find the - newly deleted <tt>struct foo</tt>. -<li> If the task invoking <tt>synchronize_rcu()</tt> remains - on a given CPU, then that CPU is guaranteed to execute a full - memory barrier sometime during the execution of - <tt>synchronize_rcu()</tt>. - This guarantee ensures that the <tt>kfree()</tt> on - line 14 of <tt>remove_gp_synchronous()</tt> really does - execute after the removal on line 11. -<li> If the task invoking <tt>synchronize_rcu()</tt> migrates - among a group of CPUs during that invocation, then each of the - CPUs in that group is guaranteed to execute a full memory barrier - sometime during the execution of <tt>synchronize_rcu()</tt>. - This guarantee also ensures that the <tt>kfree()</tt> on - line 14 of <tt>remove_gp_synchronous()</tt> really does - execute after the removal on - line 11, but also in the case where the thread executing the - <tt>synchronize_rcu()</tt> migrates in the meantime. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Given that multiple CPUs can start RCU read-side critical sections - at any time without any ordering whatsoever, how can RCU possibly - tell whether or not a given RCU read-side critical section starts - before a given instance of <tt>synchronize_rcu()</tt>? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - If RCU cannot tell whether or not a given - RCU read-side critical section starts before a - given instance of <tt>synchronize_rcu()</tt>, - then it must assume that the RCU read-side critical section - started first. - In other words, a given instance of <tt>synchronize_rcu()</tt> - can avoid waiting on a given RCU read-side critical section only - if it can prove that <tt>synchronize_rcu()</tt> started first. - </font> - - <p><font color="ffffff"> - A related question is “When <tt>rcu_read_lock()</tt> - doesn't generate any code, why does it matter how it relates - to a grace period?” - The answer is that it is not the relationship of - <tt>rcu_read_lock()</tt> itself that is important, but rather - the relationship of the code within the enclosed RCU read-side - critical section to the code preceding and following the - grace period. - If we take this viewpoint, then a given RCU read-side critical - section begins before a given grace period when some access - preceding the grace period observes the effect of some access - within the critical section, in which case none of the accesses - within the critical section may observe the effects of any - access following the grace period. - </font> - - <p><font color="ffffff"> - As of late 2016, mathematical models of RCU take this - viewpoint, for example, see slides 62 and 63 - of the - <a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a> - presentation. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - The first and second guarantees require unbelievably strict ordering! - Are all these memory barriers <i> really</i> required? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Yes, they really are required. - To see why the first guarantee is required, consider the following - sequence of events: - </font> - - <ol> - <li> <font color="ffffff"> - CPU 1: <tt>rcu_read_lock()</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>q = rcu_dereference(gp); - /* Very likely to return p. */</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>list_del_rcu(p);</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>synchronize_rcu()</tt> starts. - </font> - <li> <font color="ffffff"> - CPU 1: <tt>do_something_with(q->a); - /* No smp_mb(), so might happen after kfree(). */</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>rcu_read_unlock()</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>synchronize_rcu()</tt> returns. - </font> - <li> <font color="ffffff"> - CPU 0: <tt>kfree(p);</tt> - </font> - </ol> - - <p><font color="ffffff"> - Therefore, there absolutely must be a full memory barrier between the - end of the RCU read-side critical section and the end of the - grace period. - </font> - - <p><font color="ffffff"> - The sequence of events demonstrating the necessity of the second rule - is roughly similar: - </font> - - <ol> - <li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt> - </font> - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts. - </font> - <li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt> - </font> - <li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp); - /* Might return p if no memory barrier. */</tt> - </font> - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns. - </font> - <li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> - </font> - <li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt> - </font> - </ol> - - <p><font color="ffffff"> - And similarly, without a memory barrier between the beginning of the - grace period and the beginning of the RCU read-side critical section, - CPU 1 might end up accessing the freelist. - </font> - - <p><font color="ffffff"> - The “as if” rule of course applies, so that any - implementation that acts as if the appropriate memory barriers - were in place is a correct implementation. - That said, it is much easier to fool yourself into believing - that you have adhered to the as-if rule than it is to actually - adhere to it! -</font></td></tr> -<tr><td> </td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> - generate absolutely no code in some kernel builds. - This means that the compiler might arbitrarily rearrange consecutive - RCU read-side critical sections. - Given such rearrangement, if a given RCU read-side critical section - is done, how can you be sure that all prior RCU read-side critical - sections are done? - Won't the compiler rearrangements make that impossible to determine? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> - generate absolutely no code, RCU infers quiescent states only at - special locations, for example, within the scheduler. - Because calls to <tt>schedule()</tt> had better prevent calling-code - accesses to shared variables from being rearranged across the call to - <tt>schedule()</tt>, if RCU detects the end of a given RCU read-side - critical section, it will necessarily detect the end of all prior - RCU read-side critical sections, no matter how aggressively the - compiler scrambles the code. - </font> - - <p><font color="ffffff"> - Again, this all assumes that the compiler cannot scramble code across - calls to the scheduler, out of interrupt handlers, into the idle loop, - into user-mode code, and so on. - But if your kernel build allows that sort of scrambling, you have broken - far more than just RCU! -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -Note that these memory-barrier requirements do not replace the fundamental -RCU requirement that a grace period wait for all pre-existing readers. -On the contrary, the memory barriers called out in this section must operate in -such a way as to <i>enforce</i> this fundamental requirement. -Of course, different implementations enforce this requirement in different -ways, but enforce it they must. - -<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> - -<p> -The common-case RCU primitives are unconditional. -They are invoked, they do their job, and they return, with no possibility -of error, and no need to retry. -This is a key RCU design philosophy. - -<p> -However, this philosophy is pragmatic rather than pigheaded. -If someone comes up with a good justification for a particular conditional -RCU primitive, it might well be implemented and added. -After all, this guarantee was reverse-engineered, not premeditated. -The unconditional nature of the RCU primitives was initially an -accident of implementation, and later experience with synchronization -primitives with conditional primitives caused me to elevate this -accident to a guarantee. -Therefore, the justification for adding a conditional primitive to -RCU would need to be based on detailed and compelling use cases. - -<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> - -<p> -As far as RCU is concerned, it is always possible to carry out an -update within an RCU read-side critical section. -For example, that RCU read-side critical section might search for -a given data element, and then might acquire the update-side -spinlock in order to update that element, all while remaining -in that RCU read-side critical section. -Of course, it is necessary to exit the RCU read-side critical section -before invoking <tt>synchronize_rcu()</tt>, however, this -inconvenience can be avoided through use of the -<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members -described later in this document. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But how does the upgrade-to-write operation exclude other readers? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - It doesn't, just like normal RCU updates, which also do not exclude - RCU readers. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -This guarantee allows lookup code to be shared between read-side -and update-side code, and was premeditated, appearing in the earliest -DYNIX/ptx RCU documentation. - -<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> - -<p> -RCU provides extremely lightweight readers, and its read-side guarantees, -though quite useful, are correspondingly lightweight. -It is therefore all too easy to assume that RCU is guaranteeing more -than it really is. -Of course, the list of things that RCU does not guarantee is infinitely -long, however, the following sections list a few non-guarantees that -have caused confusion. -Except where otherwise noted, these non-guarantees were premeditated. - -<ol> -<li> <a href="#Readers Impose Minimal Ordering"> - Readers Impose Minimal Ordering</a> -<li> <a href="#Readers Do Not Exclude Updaters"> - Readers Do Not Exclude Updaters</a> -<li> <a href="#Updaters Only Wait For Old Readers"> - Updaters Only Wait For Old Readers</a> -<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections"> - Grace Periods Don't Partition Read-Side Critical Sections</a> -<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods"> - Read-Side Critical Sections Don't Partition Grace Periods</a> -</ol> - -<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3> - -<p> -Reader-side markers such as <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees -except through their interaction with the grace-period APIs such as -<tt>synchronize_rcu()</tt>. -To see this, consider the following pair of threads: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(x, 1); - 5 rcu_read_unlock(); - 6 rcu_read_lock(); - 7 WRITE_ONCE(y, 1); - 8 rcu_read_unlock(); - 9 } -10 -11 void thread1(void) -12 { -13 rcu_read_lock(); -14 r1 = READ_ONCE(y); -15 rcu_read_unlock(); -16 rcu_read_lock(); -17 r2 = READ_ONCE(x); -18 rcu_read_unlock(); -19 } -</pre> -</blockquote> - -<p> -After <tt>thread0()</tt> and <tt>thread1()</tt> execute -concurrently, it is quite possible to have - -<blockquote> -<pre> -(r1 == 1 && r2 == 0) -</pre> -</blockquote> - -(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>), -which would not be possible if <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> had much in the way of ordering -properties. -But they do not, so the CPU is within its rights -to do significant reordering. -This is by design: Any significant ordering constraints would slow down -these fast-path APIs. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Can't the compiler also reorder this code? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, the volatile casts in <tt>READ_ONCE()</tt> and - <tt>WRITE_ONCE()</tt> prevent the compiler from reordering in - this particular case. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3> - -<p> -Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt> -exclude updates. -All they do is to prevent grace periods from ending. -The following example illustrates this: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 r1 = READ_ONCE(y); - 5 if (r1) { - 6 do_something_with_nonzero_x(); - 7 r2 = READ_ONCE(x); - 8 WARN_ON(!r2); /* BUG!!! */ - 9 } -10 rcu_read_unlock(); -11 } -12 -13 void thread1(void) -14 { -15 spin_lock(&my_lock); -16 WRITE_ONCE(x, 1); -17 WRITE_ONCE(y, 1); -18 spin_unlock(&my_lock); -19 } -</pre> -</blockquote> - -<p> -If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt> -excluded the <tt>thread1()</tt> function's update, -the <tt>WARN_ON()</tt> could never fire. -But the fact is that <tt>rcu_read_lock()</tt> does not exclude -much of anything aside from subsequent grace periods, of which -<tt>thread1()</tt> has none, so the -<tt>WARN_ON()</tt> can and does fire. - -<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3> - -<p> -It might be tempting to assume that after <tt>synchronize_rcu()</tt> -completes, there are no readers executing. -This temptation must be avoided because -new readers can start immediately after <tt>synchronize_rcu()</tt> -starts, and <tt>synchronize_rcu()</tt> is under no -obligation to wait for these new readers. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Suppose that synchronize_rcu() did wait until <i>all</i> - readers had completed instead of waiting only on - pre-existing readers. - For how long would the updater be able to rely on there - being no readers? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - For no time at all. - Even if <tt>synchronize_rcu()</tt> were to wait until - all readers had completed, a new reader might start immediately after - <tt>synchronize_rcu()</tt> completed. - Therefore, the code following - <tt>synchronize_rcu()</tt> can <i>never</i> rely on there being - no readers. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections"> -Grace Periods Don't Partition Read-Side Critical Sections</a></h3> - -<p> -It is tempting to assume that if any part of one RCU read-side critical -section precedes a given grace period, and if any part of another RCU -read-side critical section follows that same grace period, then all of -the first RCU read-side critical section must precede all of the second. -However, this just isn't the case: A single grace period does not -partition the set of RCU read-side critical sections. -An example of this situation can be illustrated as follows, where -<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(a, 1); - 5 WRITE_ONCE(b, 1); - 6 rcu_read_unlock(); - 7 } - 8 - 9 void thread1(void) -10 { -11 r1 = READ_ONCE(a); -12 synchronize_rcu(); -13 WRITE_ONCE(c, 1); -14 } -15 -16 void thread2(void) -17 { -18 rcu_read_lock(); -19 r2 = READ_ONCE(b); -20 r3 = READ_ONCE(c); -21 rcu_read_unlock(); -22 } -</pre> -</blockquote> - -<p> -It turns out that the outcome: - -<blockquote> -<pre> -(r1 == 1 && r2 == 0 && r3 == 1) -</pre> -</blockquote> - -is entirely possible. -The following figure show how this can happen, with each circled -<tt>QS</tt> indicating the point at which RCU recorded a -<i>quiescent state</i> for each thread, that is, a state in which -RCU knows that the thread cannot be in the midst of an RCU read-side -critical section that started before the current grace period: - -<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p> - -<p> -If it is necessary to partition RCU read-side critical sections in this -manner, it is necessary to use two grace periods, where the first -grace period is known to end before the second grace period starts: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(a, 1); - 5 WRITE_ONCE(b, 1); - 6 rcu_read_unlock(); - 7 } - 8 - 9 void thread1(void) -10 { -11 r1 = READ_ONCE(a); -12 synchronize_rcu(); -13 WRITE_ONCE(c, 1); -14 } -15 -16 void thread2(void) -17 { -18 r2 = READ_ONCE(c); -19 synchronize_rcu(); -20 WRITE_ONCE(d, 1); -21 } -22 -23 void thread3(void) -24 { -25 rcu_read_lock(); -26 r3 = READ_ONCE(b); -27 r4 = READ_ONCE(d); -28 rcu_read_unlock(); -29 } -</pre> -</blockquote> - -<p> -Here, if <tt>(r1 == 1)</tt>, then -<tt>thread0()</tt>'s write to <tt>b</tt> must happen -before the end of <tt>thread1()</tt>'s grace period. -If in addition <tt>(r4 == 1)</tt>, then -<tt>thread3()</tt>'s read from <tt>b</tt> must happen -after the beginning of <tt>thread2()</tt>'s grace period. -If it is also the case that <tt>(r2 == 1)</tt>, then the -end of <tt>thread1()</tt>'s grace period must precede the -beginning of <tt>thread2()</tt>'s grace period. -This mean that the two RCU read-side critical sections cannot overlap, -guaranteeing that <tt>(r3 == 1)</tt>. -As a result, the outcome: - -<blockquote> -<pre> -(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) -</pre> -</blockquote> - -cannot happen. - -<p> -This non-requirement was also non-premeditated, but became apparent -when studying RCU's interaction with memory ordering. - -<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods"> -Read-Side Critical Sections Don't Partition Grace Periods</a></h3> - -<p> -It is also tempting to assume that if an RCU read-side critical section -happens between a pair of grace periods, then those grace periods cannot -overlap. -However, this temptation leads nowhere good, as can be illustrated by -the following, with all variables initially zero: - -<blockquote> -<pre> - 1 void thread0(void) - 2 { - 3 rcu_read_lock(); - 4 WRITE_ONCE(a, 1); - 5 WRITE_ONCE(b, 1); - 6 rcu_read_unlock(); - 7 } - 8 - 9 void thread1(void) -10 { -11 r1 = READ_ONCE(a); -12 synchronize_rcu(); -13 WRITE_ONCE(c, 1); -14 } -15 -16 void thread2(void) -17 { -18 rcu_read_lock(); -19 WRITE_ONCE(d, 1); -20 r2 = READ_ONCE(c); -21 rcu_read_unlock(); -22 } -23 -24 void thread3(void) -25 { -26 r3 = READ_ONCE(d); -27 synchronize_rcu(); -28 WRITE_ONCE(e, 1); -29 } -30 -31 void thread4(void) -32 { -33 rcu_read_lock(); -34 r4 = READ_ONCE(b); -35 r5 = READ_ONCE(e); -36 rcu_read_unlock(); -37 } -</pre> -</blockquote> - -<p> -In this case, the outcome: - -<blockquote> -<pre> -(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) -</pre> -</blockquote> - -is entirely possible, as illustrated below: - -<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p> - -<p> -Again, an RCU read-side critical section can overlap almost all of a -given grace period, just so long as it does not overlap the entire -grace period. -As a result, an RCU read-side critical section cannot partition a pair -of RCU grace periods. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - How long a sequence of grace periods, each separated by an RCU - read-side critical section, would be required to partition the RCU - read-side critical sections at the beginning and end of the chain? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In theory, an infinite number. - In practice, an unknown number that is sensitive to both implementation - details and timing considerations. - Therefore, even in practice, RCU users must abide by the - theoretical rather than the practical answer. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> - -<p> -These parallelism facts of life are by no means specific to RCU, but -the RCU implementation must abide by them. -They therefore bear repeating: - -<ol> -<li> Any CPU or task may be delayed at any time, - and any attempts to avoid these delays by disabling - preemption, interrupts, or whatever are completely futile. - This is most obvious in preemptible user-level - environments and in virtualized environments (where - a given guest OS's VCPUs can be preempted at any time by - the underlying hypervisor), but can also happen in bare-metal - environments due to ECC errors, NMIs, and other hardware - events. - Although a delay of more than about 20 seconds can result - in splats, the RCU implementation is obligated to use - algorithms that can tolerate extremely long delays, but where - “extremely long” is not long enough to allow - wrap-around when incrementing a 64-bit counter. -<li> Both the compiler and the CPU can reorder memory accesses. - Where it matters, RCU must use compiler directives and - memory-barrier instructions to preserve ordering. -<li> Conflicting writes to memory locations in any given cache line - will result in expensive cache misses. - Greater numbers of concurrent writes and more-frequent - concurrent writes will result in more dramatic slowdowns. - RCU is therefore obligated to use algorithms that have - sufficient locality to avoid significant performance and - scalability problems. -<li> As a rough rule of thumb, only one CPU's worth of processing - may be carried out under the protection of any given exclusive - lock. - RCU must therefore use scalable locking designs. -<li> Counters are finite, especially on 32-bit systems. - RCU's use of counters must therefore tolerate counter wrap, - or be designed such that counter wrap would take way more - time than a single system is likely to run. - An uptime of ten years is quite possible, a runtime - of a century much less so. - As an example of the latter, RCU's dyntick-idle nesting counter - allows 54 bits for interrupt nesting level (this counter - is 64 bits even on a 32-bit system). - Overflowing this counter requires 2<sup>54</sup> - half-interrupts on a given CPU without that CPU ever going idle. - If a half-interrupt happened every microsecond, it would take - 570 years of runtime to overflow this counter, which is currently - believed to be an acceptably long time. -<li> Linux systems can have thousands of CPUs running a single - Linux kernel in a single shared-memory environment. - RCU must therefore pay close attention to high-end scalability. -</ol> - -<p> -This last parallelism fact of life means that RCU must pay special -attention to the preceding facts of life. -The idea that Linux might scale to systems with thousands of CPUs would -have been met with some skepticism in the 1990s, but these requirements -would have otherwise have been unsurprising, even in the early 1990s. - -<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2> - -<p> -These sections list quality-of-implementation requirements. -Although an RCU implementation that ignores these requirements could -still be used, it would likely be subject to limitations that would -make it inappropriate for industrial-strength production use. -Classes of quality-of-implementation requirements are as follows: - -<ol> -<li> <a href="#Specialization">Specialization</a> -<li> <a href="#Performance and Scalability">Performance and Scalability</a> -<li> <a href="#Forward Progress">Forward Progress</a> -<li> <a href="#Composability">Composability</a> -<li> <a href="#Corner Cases">Corner Cases</a> -</ol> - -<p> -These classes is covered in the following sections. - -<h3><a name="Specialization">Specialization</a></h3> - -<p> -RCU is and always has been intended primarily for read-mostly situations, -which means that RCU's read-side primitives are optimized, often at the -expense of its update-side primitives. -Experience thus far is captured by the following list of situations: - -<ol> -<li> Read-mostly data, where stale and inconsistent data is not - a problem: RCU works great! -<li> Read-mostly data, where data must be consistent: - RCU works well. -<li> Read-write data, where data must be consistent: - RCU <i>might</i> work OK. - Or not. -<li> Write-mostly data, where data must be consistent: - RCU is very unlikely to be the right tool for the job, - with the following exceptions, where RCU can provide: - <ol type=a> - <li> Existence guarantees for update-friendly mechanisms. - <li> Wait-free read-side primitives for real-time use. - </ol> -</ol> - -<p> -This focus on read-mostly situations means that RCU must interoperate -with other synchronization primitives. -For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt> -examples discussed earlier use RCU to protect readers and locking to -coordinate updaters. -However, the need extends much farther, requiring that a variety of -synchronization primitives be legal within RCU read-side critical sections, -including spinlocks, sequence locks, atomic operations, reference -counters, and memory barriers. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - What about sleeping locks? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - These are forbidden within Linux-kernel RCU read-side critical - sections because it is not legal to place a quiescent state - (in this case, voluntary context switch) within an RCU read-side - critical section. - However, sleeping locks may be used within userspace RCU read-side - critical sections, and also within Linux-kernel sleepable RCU - <a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a> - read-side critical sections. - In addition, the -rt patchset turns spinlocks into a - sleeping locks so that the corresponding critical sections - can be preempted, which also means that these sleeplockified - spinlocks (but not other sleeping locks!) may be acquire within - -rt-Linux-kernel RCU read-side critical sections. - </font> - - <p><font color="ffffff"> - Note that it <i>is</i> legal for a normal RCU read-side - critical section to conditionally acquire a sleeping locks - (as in <tt>mutex_trylock()</tt>), but only as long as it does - not loop indefinitely attempting to conditionally acquire that - sleeping locks. - The key point is that things like <tt>mutex_trylock()</tt> - either return with the mutex held, or return an error indication if - the mutex was not immediately available. - Either way, <tt>mutex_trylock()</tt> returns immediately without - sleeping. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -It often comes as a surprise that many algorithms do not require a -consistent view of data, but many can function in that mode, -with network routing being the poster child. -Internet routing algorithms take significant time to propagate -updates, so that by the time an update arrives at a given system, -that system has been sending network traffic the wrong way for -a considerable length of time. -Having a few threads continue to send traffic the wrong way for a -few more milliseconds is clearly not a problem: In the worst case, -TCP retransmissions will eventually get the data where it needs to go. -In general, when tracking the state of the universe outside of the -computer, some level of inconsistency must be tolerated due to -speed-of-light delays if nothing else. - -<p> -Furthermore, uncertainty about external state is inherent in many cases. -For example, a pair of veterinarians might use heartbeat to determine -whether or not a given cat was alive. -But how long should they wait after the last heartbeat to decide that -the cat is in fact dead? -Waiting less than 400 milliseconds makes no sense because this would -mean that a relaxed cat would be considered to cycle between death -and life more than 100 times per minute. -Moreover, just as with human beings, a cat's heart might stop for -some period of time, so the exact wait period is a judgment call. -One of our pair of veterinarians might wait 30 seconds before pronouncing -the cat dead, while the other might insist on waiting a full minute. -The two veterinarians would then disagree on the state of the cat during -the final 30 seconds of the minute following the last heartbeat. - -<p> -Interestingly enough, this same situation applies to hardware. -When push comes to shove, how do we tell whether or not some -external server has failed? -We send messages to it periodically, and declare it failed if we -don't receive a response within a given period of time. -Policy decisions can usually tolerate short -periods of inconsistency. -The policy was decided some time ago, and is only now being put into -effect, so a few milliseconds of delay is normally inconsequential. - -<p> -However, there are algorithms that absolutely must see consistent data. -For example, the translation between a user-level SystemV semaphore -ID to the corresponding in-kernel data structure is protected by RCU, -but it is absolutely forbidden to update a semaphore that has just been -removed. -In the Linux kernel, this need for consistency is accommodated by acquiring -spinlocks located in the in-kernel data structure from within -the RCU read-side critical section, and this is indicated by the -green box in the figure above. -Many other techniques may be used, and are in fact used within the -Linux kernel. - -<p> -In short, RCU is not required to maintain consistency, and other -mechanisms may be used in concert with RCU when consistency is required. -RCU's specialization allows it to do its job extremely well, and its -ability to interoperate with other synchronization mechanisms allows -the right mix of synchronization tools to be used for a given job. - -<h3><a name="Performance and Scalability">Performance and Scalability</a></h3> - -<p> -Energy efficiency is a critical component of performance today, -and Linux-kernel RCU implementations must therefore avoid unnecessarily -awakening idle CPUs. -I cannot claim that this requirement was premeditated. -In fact, I learned of it during a telephone conversation in which I -was given “frank and open” feedback on the importance -of energy efficiency in battery-powered systems and on specific -energy-efficiency shortcomings of the Linux-kernel RCU implementation. -In my experience, the battery-powered embedded community will consider -any unnecessary wakeups to be extremely unfriendly acts. -So much so that mere Linux-kernel-mailing-list posts are -insufficient to vent their ire. - -<p> -Memory consumption is not particularly important for in most -situations, and has become decreasingly -so as memory sizes have expanded and memory -costs have plummeted. -However, as I learned from Matt Mackall's -<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a> -efforts, memory footprint is critically important on single-CPU systems with -non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus -<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a> -was born. -Josh Triplett has since taken over the small-memory banner with his -<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a> -project, which resulted in -<a href="#Sleepable RCU">SRCU</a> -becoming optional for those kernels not needing it. - -<p> -The remaining performance requirements are, for the most part, -unsurprising. -For example, in keeping with RCU's read-side specialization, -<tt>rcu_dereference()</tt> should have negligible overhead (for -example, suppression of a few minor compiler optimizations). -Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> should have exactly zero overhead. - -<p> -In preemptible environments, in the case where the RCU read-side -critical section was not preempted (as will be the case for the -highest-priority real-time process), <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> should have minimal overhead. -In particular, they should not contain atomic read-modify-write -operations, memory-barrier instructions, preemption disabling, -interrupt disabling, or backwards branches. -However, in the case where the RCU read-side critical section was preempted, -<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts. -This is why it is better to nest an RCU read-side critical section -within a preempt-disable region than vice versa, at least in cases -where that critical section is short enough to avoid unduly degrading -real-time latencies. - -<p> -The <tt>synchronize_rcu()</tt> grace-period-wait primitive is -optimized for throughput. -It may therefore incur several milliseconds of latency in addition to -the duration of the longest RCU read-side critical section. -On the other hand, multiple concurrent invocations of -<tt>synchronize_rcu()</tt> are required to use batching optimizations -so that they can be satisfied by a single underlying grace-period-wait -operation. -For example, in the Linux kernel, it is not unusual for a single -grace-period-wait operation to serve more than -<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a> -of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation -overhead down to nearly zero. -However, the grace-period optimization is also required to avoid -measurable degradation of real-time scheduling and interrupt latencies. - -<p> -In some cases, the multi-millisecond <tt>synchronize_rcu()</tt> -latencies are unacceptable. -In these cases, <tt>synchronize_rcu_expedited()</tt> may be used -instead, reducing the grace-period latency down to a few tens of -microseconds on small systems, at least in cases where the RCU read-side -critical sections are short. -There are currently no special latency requirements for -<tt>synchronize_rcu_expedited()</tt> on large systems, but, -consistent with the empirical nature of the RCU specification, -that is subject to change. -However, there most definitely are scalability requirements: -A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096 -CPUs should at least make reasonable forward progress. -In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt> -is permitted to impose modest degradation of real-time latency -on non-idle online CPUs. -Here, “modest” means roughly the same latency -degradation as a scheduling-clock interrupt. - -<p> -There are a number of situations where even -<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period -latency is unacceptable. -In these situations, the asynchronous <tt>call_rcu()</tt> can be -used in place of <tt>synchronize_rcu()</tt> as follows: - -<blockquote> -<pre> - 1 struct foo { - 2 int a; - 3 int b; - 4 struct rcu_head rh; - 5 }; - 6 - 7 static void remove_gp_cb(struct rcu_head *rhp) - 8 { - 9 struct foo *p = container_of(rhp, struct foo, rh); -10 -11 kfree(p); -12 } -13 -14 bool remove_gp_asynchronous(void) -15 { -16 struct foo *p; -17 -18 spin_lock(&gp_lock); -19 p = rcu_access_pointer(gp); -20 if (!p) { -21 spin_unlock(&gp_lock); -22 return false; -23 } -24 rcu_assign_pointer(gp, NULL); -25 call_rcu(&p->rh, remove_gp_cb); -26 spin_unlock(&gp_lock); -27 return true; -28 } -</pre> -</blockquote> - -<p> -A definition of <tt>struct foo</tt> is finally needed, and appears -on lines 1-5. -The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt> -on line 25, and will be invoked after the end of a subsequent -grace period. -This gets the same effect as <tt>remove_gp_synchronous()</tt>, -but without forcing the updater to wait for a grace period to elapse. -The <tt>call_rcu()</tt> function may be used in a number of -situations where neither <tt>synchronize_rcu()</tt> nor -<tt>synchronize_rcu_expedited()</tt> would be legal, -including within preempt-disable code, <tt>local_bh_disable()</tt> code, -interrupt-disable code, and interrupt handlers. -However, even <tt>call_rcu()</tt> is illegal within NMI handlers -and from idle and offline CPUs. -The callback function (<tt>remove_gp_cb()</tt> in this case) will be -executed within softirq (software interrupt) environment within the -Linux kernel, -either within a real softirq handler or under the protection -of <tt>local_bh_disable()</tt>. -In both the Linux kernel and in userspace, it is bad practice to -write an RCU callback function that takes too long. -Long-running operations should be relegated to separate threads or -(in the Linux kernel) workqueues. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why does line 19 use <tt>rcu_access_pointer()</tt>? - After all, <tt>call_rcu()</tt> on line 25 stores into the - structure, which would interact badly with concurrent insertions. - Doesn't this mean that <tt>rcu_dereference()</tt> is required? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes - any changes, including any insertions that <tt>rcu_dereference()</tt> - would protect against. - Therefore, any insertions will be delayed until after - <tt>->gp_lock</tt> - is released on line 25, which in turn means that - <tt>rcu_access_pointer()</tt> suffices. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -However, all that <tt>remove_gp_cb()</tt> is doing is -invoking <tt>kfree()</tt> on the data element. -This is a common idiom, and is supported by <tt>kfree_rcu()</tt>, -which allows “fire and forget” operation as shown below: - -<blockquote> -<pre> - 1 struct foo { - 2 int a; - 3 int b; - 4 struct rcu_head rh; - 5 }; - 6 - 7 bool remove_gp_faf(void) - 8 { - 9 struct foo *p; -10 -11 spin_lock(&gp_lock); -12 p = rcu_dereference(gp); -13 if (!p) { -14 spin_unlock(&gp_lock); -15 return false; -16 } -17 rcu_assign_pointer(gp, NULL); -18 kfree_rcu(p, rh); -19 spin_unlock(&gp_lock); -20 return true; -21 } -</pre> -</blockquote> - -<p> -Note that <tt>remove_gp_faf()</tt> simply invokes -<tt>kfree_rcu()</tt> and proceeds, without any need to pay any -further attention to the subsequent grace period and <tt>kfree()</tt>. -It is permissible to invoke <tt>kfree_rcu()</tt> from the same -environments as for <tt>call_rcu()</tt>. -Interestingly enough, DYNIX/ptx had the equivalents of -<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not -<tt>synchronize_rcu()</tt>. -This was due to the fact that RCU was not heavily used within DYNIX/ptx, -so the very few places that needed something like -<tt>synchronize_rcu()</tt> simply open-coded it. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Earlier it was claimed that <tt>call_rcu()</tt> and - <tt>kfree_rcu()</tt> allowed updaters to avoid being blocked - by readers. - But how can that be correct, given that the invocation of the callback - and the freeing of the memory (respectively) must still wait for - a grace period to elapse? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - We could define things this way, but keep in mind that this sort of - definition would say that updates in garbage-collected languages - cannot complete until the next time the garbage collector runs, - which does not seem at all reasonable. - The key point is that in most cases, an updater using either - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the - next update as soon as it has invoked <tt>call_rcu()</tt> or - <tt>kfree_rcu()</tt>, without having to wait for a subsequent - grace period. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -But what if the updater must wait for the completion of code to be -executed after the end of the grace period, but has other tasks -that can be carried out in the meantime? -The polling-style <tt>get_state_synchronize_rcu()</tt> and -<tt>cond_synchronize_rcu()</tt> functions may be used for this -purpose, as shown below: - -<blockquote> -<pre> - 1 bool remove_gp_poll(void) - 2 { - 3 struct foo *p; - 4 unsigned long s; - 5 - 6 spin_lock(&gp_lock); - 7 p = rcu_access_pointer(gp); - 8 if (!p) { - 9 spin_unlock(&gp_lock); -10 return false; -11 } -12 rcu_assign_pointer(gp, NULL); -13 spin_unlock(&gp_lock); -14 s = get_state_synchronize_rcu(); -15 do_something_while_waiting(); -16 cond_synchronize_rcu(s); -17 kfree(p); -18 return true; -19 } -</pre> -</blockquote> - -<p> -On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a -“cookie” from RCU, -then line 15 carries out other tasks, -and finally, line 16 returns immediately if a grace period has -elapsed in the meantime, but otherwise waits as required. -The need for <tt>get_state_synchronize_rcu</tt> and -<tt>cond_synchronize_rcu()</tt> has appeared quite recently, -so it is too early to tell whether they will stand the test of time. - -<p> -RCU thus provides a range of tools to allow updaters to strike the -required tradeoff between latency, flexibility and CPU overhead. - -<h3><a name="Forward Progress">Forward Progress</a></h3> - -<p> -In theory, delaying grace-period completion and callback invocation -is harmless. -In practice, not only are memory sizes finite but also callbacks sometimes -do wakeups, and sufficiently deferred wakeups can be difficult -to distinguish from system hangs. -Therefore, RCU must provide a number of mechanisms to promote forward -progress. - -<p> -These mechanisms are not foolproof, nor can they be. -For one simple example, an infinite loop in an RCU read-side critical -section must by definition prevent later grace periods from ever completing. -For a more involved example, consider a 64-CPU system built with -<tt>CONFIG_RCU_NOCB_CPU=y</tt> and booted with <tt>rcu_nocbs=1-63</tt>, -where CPUs 1 through 63 spin in tight loops that invoke -<tt>call_rcu()</tt>. -Even if these tight loops also contain calls to <tt>cond_resched()</tt> -(thus allowing grace periods to complete), CPU 0 simply will -not be able to invoke callbacks as fast as the other 63 CPUs can -register them, at least not until the system runs out of memory. -In both of these examples, the Spiderman principle applies: With great -power comes great responsibility. -However, short of this level of abuse, RCU is required to -ensure timely completion of grace periods and timely invocation of -callbacks. - -<p> -RCU takes the following steps to encourage timely completion of -grace periods: - -<ol> -<li> If a grace period fails to complete within 100 milliseconds, - RCU causes future invocations of <tt>cond_resched()</tt> on - the holdout CPUs to provide an RCU quiescent state. - RCU also causes those CPUs' <tt>need_resched()</tt> invocations - to return <tt>true</tt>, but only after the corresponding CPU's - next scheduling-clock. -<li> CPUs mentioned in the <tt>nohz_full</tt> kernel boot parameter - can run indefinitely in the kernel without scheduling-clock - interrupts, which defeats the above <tt>need_resched()</tt> - strategem. - RCU will therefore invoke <tt>resched_cpu()</tt> on any - <tt>nohz_full</tt> CPUs still holding out after - 109 milliseconds. -<li> In kernels built with <tt>CONFIG_RCU_BOOST=y</tt>, if a given - task that has been preempted within an RCU read-side critical - section is holding out for more than 500 milliseconds, - RCU will resort to priority boosting. -<li> If a CPU is still holding out 10 seconds into the grace - period, RCU will invoke <tt>resched_cpu()</tt> on it regardless - of its <tt>nohz_full</tt> state. -</ol> - -<p> -The above values are defaults for systems running with <tt>HZ=1000</tt>. -They will vary as the value of <tt>HZ</tt> varies, and can also be -changed using the relevant Kconfig options and kernel boot parameters. -RCU currently does not do much sanity checking of these -parameters, so please use caution when changing them. -Note that these forward-progress measures are provided only for RCU, -not for -<a href="#Sleepable RCU">SRCU</a> or -<a href="#Tasks RCU">Tasks RCU</a>. - -<p> -RCU takes the following steps in <tt>call_rcu()</tt> to encourage timely -invocation of callbacks when any given non-<tt>rcu_nocbs</tt> CPU has -10,000 callbacks, or has 10,000 more callbacks than it had the last time -encouragement was provided: - -<ol> -<li> Starts a grace period, if one is not already in progress. -<li> Forces immediate checking for quiescent states, rather than - waiting for three milliseconds to have elapsed since the - beginning of the grace period. -<li> Immediately tags the CPU's callbacks with their grace period - completion numbers, rather than waiting for the <tt>RCU_SOFTIRQ</tt> - handler to get around to it. -<li> Lifts callback-execution batch limits, which speeds up callback - invocation at the expense of degrading realtime response. -</ol> - -<p> -Again, these are default values when running at <tt>HZ=1000</tt>, -and can be overridden. -Again, these forward-progress measures are provided only for RCU, -not for -<a href="#Sleepable RCU">SRCU</a> or -<a href="#Tasks RCU">Tasks RCU</a>. -Even for RCU, callback-invocation forward progress for <tt>rcu_nocbs</tt> -CPUs is much less well-developed, in part because workloads benefiting -from <tt>rcu_nocbs</tt> CPUs tend to invoke <tt>call_rcu()</tt> -relatively infrequently. -If workloads emerge that need both <tt>rcu_nocbs</tt> CPUs and high -<tt>call_rcu()</tt> invocation rates, then additional forward-progress -work will be required. - -<h3><a name="Composability">Composability</a></h3> - -<p> -Composability has received much attention in recent years, perhaps in part -due to the collision of multicore hardware with object-oriented techniques -designed in single-threaded environments for single-threaded use. -And in theory, RCU read-side critical sections may be composed, and in -fact may be nested arbitrarily deeply. -In practice, as with all real-world implementations of composable -constructs, there are limitations. - -<p> -Implementations of RCU for which <tt>rcu_read_lock()</tt> -and <tt>rcu_read_unlock()</tt> generate no code, such as -Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be -nested arbitrarily deeply. -After all, there is no overhead. -Except that if all these instances of <tt>rcu_read_lock()</tt> -and <tt>rcu_read_unlock()</tt> are visible to the compiler, -compilation will eventually fail due to exhausting memory, -mass storage, or user patience, whichever comes first. -If the nesting is not visible to the compiler, as is the case with -mutually recursive functions each in its own translation unit, -stack overflow will result. -If the nesting takes the form of loops, perhaps in the guise of tail -recursion, either the control variable -will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. -Nevertheless, this class of RCU implementations is one -of the most composable constructs in existence. - -<p> -RCU implementations that explicitly track nesting depth -are limited by the nesting-depth counter. -For example, the Linux kernel's preemptible RCU limits nesting to -<tt>INT_MAX</tt>. -This should suffice for almost all practical purposes. -That said, a consecutive pair of RCU read-side critical sections -between which there is an operation that waits for a grace period -cannot be enclosed in another RCU read-side critical section. -This is because it is not legal to wait for a grace period within -an RCU read-side critical section: To do so would result either -in deadlock or -in RCU implicitly splitting the enclosing RCU read-side critical -section, neither of which is conducive to a long-lived and prosperous -kernel. - -<p> -It is worth noting that RCU is not alone in limiting composability. -For example, many transactional-memory implementations prohibit -composing a pair of transactions separated by an irrevocable -operation (for example, a network receive operation). -For another example, lock-based critical sections can be composed -surprisingly freely, but only if deadlock is avoided. - -<p> -In short, although RCU read-side critical sections are highly composable, -care is required in some situations, just as is the case for any other -composable synchronization mechanism. - -<h3><a name="Corner Cases">Corner Cases</a></h3> - -<p> -A given RCU workload might have an endless and intense stream of -RCU read-side critical sections, perhaps even so intense that there -was never a point in time during which there was not at least one -RCU read-side critical section in flight. -RCU cannot allow this situation to block grace periods: As long as -all the RCU read-side critical sections are finite, grace periods -must also be finite. - -<p> -That said, preemptible RCU implementations could potentially result -in RCU read-side critical sections being preempted for long durations, -which has the effect of creating a long-duration RCU read-side -critical section. -This situation can arise only in heavily loaded systems, but systems using -real-time priorities are of course more vulnerable. -Therefore, RCU priority boosting is provided to help deal with this -case. -That said, the exact requirements on RCU priority boosting will likely -evolve as more experience accumulates. - -<p> -Other workloads might have very high update rates. -Although one can argue that such workloads should instead use -something other than RCU, the fact remains that RCU must -handle such workloads gracefully. -This requirement is another factor driving batching of grace periods, -but it is also the driving force behind the checks for large numbers -of queued RCU callbacks in the <tt>call_rcu()</tt> code path. -Finally, high update rates should not delay RCU read-side critical -sections, although some small read-side delays can occur when using -<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use -of <tt>smp_call_function_single()</tt>. - -<p> -Although all three of these corner cases were understood in the early -1990s, a simple user-level test consisting of <tt>close(open(path))</tt> -in a tight loop -in the early 2000s suddenly provided a much deeper appreciation of the -high-update-rate corner case. -This test also motivated addition of some RCU code to react to high update -rates, for example, if a given CPU finds itself with more than 10,000 -RCU callbacks queued, it will cause RCU to take evasive action by -more aggressively starting grace periods and more aggressively forcing -completion of grace-period processing. -This evasive action causes the grace period to complete more quickly, -but at the cost of restricting RCU's batching optimizations, thus -increasing the CPU overhead incurred by that grace period. - -<h2><a name="Software-Engineering Requirements"> -Software-Engineering Requirements</a></h2> - -<p> -Between Murphy's Law and “To err is human”, it is necessary to -guard against mishaps and misuse: - -<ol> -<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt> - everywhere that it is needed, so kernels built with - <tt>CONFIG_PROVE_RCU=y</tt> will splat if - <tt>rcu_dereference()</tt> is used outside of an - RCU read-side critical section. - Update-side code can use <tt>rcu_dereference_protected()</tt>, - which takes a - <a href="https://lwn.net/Articles/371986/">lockdep expression</a> - to indicate what is providing the protection. - If the indicated protection is not provided, a lockdep splat - is emitted. - - <p> - Code shared between readers and updaters can use - <tt>rcu_dereference_check()</tt>, which also takes a - lockdep expression, and emits a lockdep splat if neither - <tt>rcu_read_lock()</tt> nor the indicated protection - is in place. - In addition, <tt>rcu_dereference_raw()</tt> is used in those - (hopefully rare) cases where the required protection cannot - be easily described. - Finally, <tt>rcu_read_lock_held()</tt> is provided to - allow a function to verify that it has been invoked within - an RCU read-side critical section. - I was made aware of this set of requirements shortly after Thomas - Gleixner audited a number of RCU uses. -<li> A given function might wish to check for RCU-related preconditions - upon entry, before using any other RCU API. - The <tt>rcu_lockdep_assert()</tt> does this job, - asserting the expression in kernels having lockdep enabled - and doing nothing otherwise. -<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt> - and <tt>rcu_dereference()</tt>, perhaps (incorrectly) - substituting a simple assignment. - To catch this sort of error, a given RCU-protected pointer may be - tagged with <tt>__rcu</tt>, after which sparse - will complain about simple-assignment accesses to that pointer. - Arnd Bergmann made me aware of this requirement, and also - supplied the needed - <a href="https://lwn.net/Articles/376011/">patch series</a>. -<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt> - will splat if a data element is passed to <tt>call_rcu()</tt> - twice in a row, without a grace period in between. - (This error is similar to a double free.) - The corresponding <tt>rcu_head</tt> structures that are - dynamically allocated are automatically tracked, but - <tt>rcu_head</tt> structures allocated on the stack - must be initialized with <tt>init_rcu_head_on_stack()</tt> - and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>. - Similarly, statically allocated non-stack <tt>rcu_head</tt> - structures must be initialized with <tt>init_rcu_head()</tt> - and cleaned up with <tt>destroy_rcu_head()</tt>. - Mathieu Desnoyers made me aware of this requirement, and also - supplied the needed - <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>. -<li> An infinite loop in an RCU read-side critical section will - eventually trigger an RCU CPU stall warning splat, with - the duration of “eventually” being controlled by the - <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or, - alternatively, by the - <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs - parameter. - However, RCU is not obligated to produce this splat - unless there is a grace period waiting on that particular - RCU read-side critical section. - <p> - Some extreme workloads might intentionally delay - RCU grace periods, and systems running those workloads can - be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt> - to suppress the splats. - This kernel parameter may also be set via <tt>sysfs</tt>. - Furthermore, RCU CPU stall warnings are counter-productive - during sysrq dumps and during panics. - RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and - <tt>rcu_sysrq_end()</tt> API members to be called before - and after long sysrq dumps. - RCU also supplies the <tt>rcu_panic()</tt> notifier that is - automatically invoked at the beginning of a panic to suppress - further RCU CPU stall warnings. - - <p> - This requirement made itself known in the early 1990s, pretty - much the first time that it was necessary to debug a CPU stall. - That said, the initial implementation in DYNIX/ptx was quite - generic in comparison with that of Linux. -<li> Although it would be very good to detect pointers leaking out - of RCU read-side critical sections, there is currently no - good way of doing this. - One complication is the need to distinguish between pointers - leaking and pointers that have been handed off from RCU to - some other synchronization mechanism, for example, reference - counting. -<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related - information is provided via event tracing. -<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and - <tt>rcu_dereference()</tt> to create typical linked - data structures can be surprisingly error-prone. - Therefore, RCU-protected - <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a> - and, more recently, RCU-protected - <a href="https://lwn.net/Articles/612100/">hash tables</a> - are available. - Many other special-purpose RCU-protected data structures are - available in the Linux kernel and the userspace RCU library. -<li> Some linked structures are created at compile time, but still - require <tt>__rcu</tt> checking. - The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this - purpose. -<li> It is not necessary to use <tt>rcu_assign_pointer()</tt> - when creating linked structures that are to be published via - a single external pointer. - The <tt>RCU_INIT_POINTER()</tt> macro is provided for - this task and also for assigning <tt>NULL</tt> pointers - at runtime. -</ol> - -<p> -This not a hard-and-fast list: RCU's diagnostic capabilities will -continue to be guided by the number and type of usage bugs found -in real-world RCU usage. - -<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2> - -<p> -The Linux kernel provides an interesting environment for all kinds of -software, including RCU. -Some of the relevant points of interest are as follows: - -<ol> -<li> <a href="#Configuration">Configuration</a>. -<li> <a href="#Firmware Interface">Firmware Interface</a>. -<li> <a href="#Early Boot">Early Boot</a>. -<li> <a href="#Interrupts and NMIs"> - Interrupts and non-maskable interrupts (NMIs)</a>. -<li> <a href="#Loadable Modules">Loadable Modules</a>. -<li> <a href="#Hotplug CPU">Hotplug CPU</a>. -<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. -<li> <a href="#Tracing and RCU">Tracing and RCU</a>. -<li> <a href="#Energy Efficiency">Energy Efficiency</a>. -<li> <a href="#Scheduling-Clock Interrupts and RCU"> - Scheduling-Clock Interrupts and RCU</a>. -<li> <a href="#Memory Efficiency">Memory Efficiency</a>. -<li> <a href="#Performance, Scalability, Response Time, and Reliability"> - Performance, Scalability, Response Time, and Reliability</a>. -</ol> - -<p> -This list is probably incomplete, but it does give a feel for the -most notable Linux-kernel complications. -Each of the following sections covers one of the above topics. - -<h3><a name="Configuration">Configuration</a></h3> - -<p> -RCU's goal is automatic configuration, so that almost nobody -needs to worry about RCU's <tt>Kconfig</tt> options. -And for almost all users, RCU does in fact work well -“out of the box.” - -<p> -However, there are specialized use cases that are handled by -kernel boot parameters and <tt>Kconfig</tt> options. -Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users -about new <tt>Kconfig</tt> options, which requires almost all of them -be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option. - -<p> -This all should be quite obvious, but the fact remains that -Linus Torvalds recently had to -<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a> -me of this requirement. - -<h3><a name="Firmware Interface">Firmware Interface</a></h3> - -<p> -In many cases, kernel obtains information about the system from the -firmware, and sometimes things are lost in translation. -Or the translation is accurate, but the original message is bogus. - -<p> -For example, some systems' firmware overreports the number of CPUs, -sometimes by a large factor. -If RCU naively believed the firmware, as it used to do, -it would create too many per-CPU kthreads. -Although the resulting system will still run correctly, the extra -kthreads needlessly consume memory and can cause confusion -when they show up in <tt>ps</tt> listings. - -<p> -RCU must therefore wait for a given CPU to actually come online before -it can allow itself to believe that the CPU actually exists. -The resulting “ghost CPUs” (which are never going to -come online) cause a number of -<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>. - -<h3><a name="Early Boot">Early Boot</a></h3> - -<p> -The Linux kernel's boot sequence is an interesting process, -and RCU is used early, even before <tt>rcu_init()</tt> -is invoked. -In fact, a number of RCU's primitives can be used as soon as the -initial task's <tt>task_struct</tt> is available and the -boot CPU's per-CPU variables are set up. -The read-side primitives (<tt>rcu_read_lock()</tt>, -<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, -and <tt>rcu_access_pointer()</tt>) will operate normally very early on, -as will <tt>rcu_assign_pointer()</tt>. - -<p> -Although <tt>call_rcu()</tt> may be invoked at any -time during boot, callbacks are not guaranteed to be invoked until after -all of RCU's kthreads have been spawned, which occurs at -<tt>early_initcall()</tt> time. -This delay in callback invocation is due to the fact that RCU does not -invoke callbacks until it is fully initialized, and this full initialization -cannot occur until after the scheduler has initialized itself to the -point where RCU can spawn and run its kthreads. -In theory, it would be possible to invoke callbacks earlier, -however, this is not a panacea because there would be severe restrictions -on what operations those callbacks could invoke. - -<p> -Perhaps surprisingly, <tt>synchronize_rcu()</tt> and -<tt>synchronize_rcu_expedited()</tt>, -will operate normally -during very early boot, the reason being that there is only one CPU -and preemption is disabled. -This means that the call <tt>synchronize_rcu()</tt> (or friends) -itself is a quiescent -state and thus a grace period, so the early-boot implementation can -be a no-op. - -<p> -However, once the scheduler has spawned its first kthread, this early -boot trick fails for <tt>synchronize_rcu()</tt> (as well as for -<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt> -kernels. -The reason is that an RCU read-side critical section might be preempted, -which means that a subsequent <tt>synchronize_rcu()</tt> really does have -to wait for something, as opposed to simply returning immediately. -Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of -its kthreads are spawned, which doesn't happen until some time during -<tt>early_initcalls()</tt> time. -But this is no excuse: RCU is nevertheless required to correctly handle -synchronous grace periods during this time period. -Once all of its kthreads are up and running, RCU starts running -normally. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - How can RCU possibly handle grace periods before all of its - kthreads have been spawned??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Very carefully! - </font> - - <p><font color="ffffff"> - During the “dead zone” between the time that the - scheduler spawns the first task and the time that all of RCU's - kthreads have been spawned, all synchronous grace periods are - handled by the expedited grace-period mechanism. - At runtime, this expedited mechanism relies on workqueues, but - during the dead zone the requesting task itself drives the - desired expedited grace period. - Because dead-zone execution takes place within task context, - everything works. - Once the dead zone ends, expedited grace periods go back to - using workqueues, as is required to avoid problems that would - otherwise occur when a user task received a POSIX signal while - driving an expedited grace period. - </font> - - <p><font color="ffffff"> - And yes, this does mean that it is unhelpful to send POSIX - signals to random tasks between the time that the scheduler - spawns its first kthread and the time that RCU's kthreads - have all been spawned. - If there ever turns out to be a good reason for sending POSIX - signals during that time, appropriate adjustments will be made. - (If it turns out that POSIX signals are sent during this time for - no good reason, other adjustments will be made, appropriate - or otherwise.) -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -I learned of these boot-time requirements as a result of a series of -system hangs. - -<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3> - -<p> -The Linux kernel has interrupts, and RCU read-side critical sections are -legal within interrupt handlers and within interrupt-disabled regions -of code, as are invocations of <tt>call_rcu()</tt>. - -<p> -Some Linux-kernel architectures can enter an interrupt handler from -non-idle process context, and then just never leave it, instead stealthily -transitioning back to process context. -This trick is sometimes used to invoke system calls from inside the kernel. -These “half-interrupts” mean that RCU has to be very careful -about how it counts interrupt nesting levels. -I learned of this requirement the hard way during a rewrite -of RCU's dyntick-idle code. - -<p> -The Linux kernel has non-maskable interrupts (NMIs), and -RCU read-side critical sections are legal within NMI handlers. -Thankfully, RCU update-side primitives, including -<tt>call_rcu()</tt>, are prohibited within NMI handlers. - -<p> -The name notwithstanding, some Linux-kernel architectures -can have nested NMIs, which RCU must handle correctly. -Andy Lutomirski -<a href="https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> -with this requirement; -he also kindly surprised me with -<a href="https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> -that meets this requirement. - -<p> -Furthermore, NMI handlers can be interrupted by what appear to RCU -to be normal interrupts. -One way that this can happen is for code that directly invokes -<tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> to be called -from an NMI handler. -This astonishing fact of life prompted the current code structure, -which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt> -and <tt>rcu_irq_exit()</tt> invoking <tt>rcu_nmi_exit()</tt>. -And yes, I also learned of this requirement the hard way. - -<h3><a name="Loadable Modules">Loadable Modules</a></h3> - -<p> -The Linux kernel has loadable modules, and these modules can -also be unloaded. -After a given module has been unloaded, any attempt to call -one of its functions results in a segmentation fault. -The module-unload functions must therefore cancel any -delayed calls to loadable-module functions, for example, -any outstanding <tt>mod_timer()</tt> must be dealt with -via <tt>del_timer_sync()</tt> or similar. - -<p> -Unfortunately, there is no way to cancel an RCU callback; -once you invoke <tt>call_rcu()</tt>, the callback function is -eventually going to be invoked, unless the system goes down first. -Because it is normally considered socially irresponsible to crash the system -in response to a module unload request, we need some other way -to deal with in-flight RCU callbacks. - -<p> -RCU therefore provides -<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>, -which waits until all in-flight RCU callbacks have been invoked. -If a module uses <tt>call_rcu()</tt>, its exit function should therefore -prevent any future invocation of <tt>call_rcu()</tt>, then invoke -<tt>rcu_barrier()</tt>. -In theory, the underlying module-unload code could invoke -<tt>rcu_barrier()</tt> unconditionally, but in practice this would -incur unacceptable latencies. - -<p> -Nikita Danilov noted this requirement for an analogous filesystem-unmount -situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU. -The need for <tt>rcu_barrier()</tt> for module unloading became -apparent later. - -<p> -<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not, -repeat, <i>not</i>, obligated to wait for a grace period. -It is instead only required to wait for RCU callbacks that have -already been posted. -Therefore, if there are no RCU callbacks posted anywhere in the system, -<tt>rcu_barrier()</tt> is within its rights to return immediately. -Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not -necessarily need to wait for a grace period. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - Each RCU callbacks must wait for a grace period to complete, - and <tt>rcu_barrier()</tt> must wait for each pre-existing - callback to be invoked. - Doesn't <tt>rcu_barrier()</tt> therefore need to wait for - a full grace period if there is even one callback posted anywhere - in the system? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Absolutely not!!! - </font> - - <p><font color="ffffff"> - Yes, each RCU callbacks must wait for a grace period to complete, - but it might well be partly (or even completely) finished waiting - by the time <tt>rcu_barrier()</tt> is invoked. - In that case, <tt>rcu_barrier()</tt> need only wait for the - remaining portion of the grace period to elapse. - So even if there are quite a few callbacks posted, - <tt>rcu_barrier()</tt> might well return quite quickly. - </font> - - <p><font color="ffffff"> - So if you need to wait for a grace period as well as for all - pre-existing callbacks, you will need to invoke both - <tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>. - If latency is a concern, you can always use workqueues - to invoke them concurrently. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Hotplug CPU">Hotplug CPU</a></h3> - -<p> -The Linux kernel supports CPU hotplug, which means that CPUs -can come and go. -It is of course illegal to use any RCU API member from an offline CPU, -with the exception of <a href="#Sleepable RCU">SRCU</a> read-side -critical sections. -This requirement was present from day one in DYNIX/ptx, but -on the other hand, the Linux kernel's CPU-hotplug implementation -is “interesting.” - -<p> -The Linux-kernel CPU-hotplug implementation has notifiers that -are used to allow the various kernel subsystems (including RCU) -to respond appropriately to a given CPU-hotplug operation. -Most RCU operations may be invoked from CPU-hotplug notifiers, -including even synchronous grace-period operations such as -<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>. - -<p> -However, all-callback-wait operations such as -<tt>rcu_barrier()</tt> are also not supported, due to the -fact that there are phases of CPU-hotplug operations where -the outgoing CPU's callbacks will not be invoked until after -the CPU-hotplug operation ends, which could also result in deadlock. -Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations -during its execution, which results in another type of deadlock -when invoked from a CPU-hotplug notifier. - -<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3> - -<p> -RCU depends on the scheduler, and the scheduler uses RCU to -protect some of its data structures. -The preemptible-RCU <tt>rcu_read_unlock()</tt> -implementation must therefore be written carefully to avoid deadlocks -involving the scheduler's runqueue and priority-inheritance locks. -In particular, <tt>rcu_read_unlock()</tt> must tolerate an -interrupt where the interrupt handler invokes both -<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. -This possibility requires <tt>rcu_read_unlock()</tt> to use -negative nesting levels to avoid destructive recursion via -interrupt handler's use of RCU. - -<p> -This scheduler-RCU requirement came as a -<a href="https://lwn.net/Articles/453002/">complete surprise</a>. - -<p> -As noted above, RCU makes use of kthreads, and it is necessary to -avoid excessive CPU-time accumulation by these kthreads. -This requirement was no surprise, but RCU's violation of it -when running context-switch-heavy workloads when built with -<tt>CONFIG_NO_HZ_FULL=y</tt> -<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. -RCU has made good progress towards meeting this requirement, even -for context-switch-heavy <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, -but there is room for further improvement. - -<p> -It is forbidden to hold any of scheduler's runqueue or priority-inheritance -spinlocks across an <tt>rcu_read_unlock()</tt> unless interrupts have been -disabled across the entire RCU read-side critical section, that is, -up to and including the matching <tt>rcu_read_lock()</tt>. -Violating this restriction can result in deadlocks involving these -scheduler spinlocks. -There was hope that this restriction might be lifted when interrupt-disabled -calls to <tt>rcu_read_unlock()</tt> started deferring the reporting of -the resulting RCU-preempt quiescent state until the end of the corresponding -interrupts-disabled region. -Unfortunately, timely reporting of the corresponding quiescent state -to expedited grace periods requires a call to <tt>raise_softirq()</tt>, -which can acquire these scheduler spinlocks. -In addition, real-time systems using RCU priority boosting -need this restriction to remain in effect because deferred -quiescent-state reporting would also defer deboosting, which in turn -would degrade real-time latencies. - -<p> -In theory, if a given RCU read-side critical section could be -guaranteed to be less than one second in duration, holding a scheduler -spinlock across that critical section's <tt>rcu_read_unlock()</tt> -would require only that preemption be disabled across the entire -RCU read-side critical section, not interrupts. -Unfortunately, given the possibility of vCPU preemption, long-running -interrupts, and so on, it is not possible in practice to guarantee -that a given RCU read-side critical section will complete in less than -one second. -Therefore, as noted above, if scheduler spinlocks are held across -a given call to <tt>rcu_read_unlock()</tt>, interrupts must be -disabled across the entire RCU read-side critical section. - -<h3><a name="Tracing and RCU">Tracing and RCU</a></h3> - -<p> -It is possible to use tracing on RCU code, but tracing itself -uses RCU. -For this reason, <tt>rcu_dereference_raw_notrace()</tt> -is provided for use by tracing, which avoids the destructive -recursion that could otherwise ensue. -This API is also used by virtualization in some architectures, -where RCU readers execute in environments in which tracing -cannot be used. -The tracing folks both located the requirement and provided the -needed fix, so this surprise requirement was relatively painless. - -<h3><a name="Energy Efficiency">Energy Efficiency</a></h3> - -<p> -Interrupting idle CPUs is considered socially unacceptable, -especially by people with battery-powered embedded systems. -RCU therefore conserves energy by detecting which CPUs are -idle, including tracking CPUs that have been interrupted from idle. -This is a large part of the energy-efficiency requirement, -so I learned of this via an irate phone call. - -<p> -Because RCU avoids interrupting idle CPUs, it is illegal to -execute an RCU read-side critical section on an idle CPU. -(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat -if you try it.) -The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt> -event tracing is provided to work around this restriction. -In addition, <tt>rcu_is_watching()</tt> may be used to -test whether or not it is currently legal to run RCU read-side -critical sections on this CPU. -I learned of the need for diagnostics on the one hand -and <tt>RCU_NONIDLE()</tt> on the other while inspecting -idle-loop code. -Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, -which is used quite heavily in the idle loop. -However, there are some restrictions on the code placed within -<tt>RCU_NONIDLE()</tt>: - -<ol> -<li> Blocking is prohibited. - In practice, this is not a serious restriction given that idle - tasks are prohibited from blocking to begin with. -<li> Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot - nest indefinitely deeply. - However, given that they can be nested on the order of a million - deep, even on 32-bit systems, this should not be a serious - restriction. - This nesting limit would probably be reached long after the - compiler OOMed or the stack overflowed. -<li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence - out of that same <tt>RCU_NONIDLE()</tt>. - For example, the following is grossly illegal: - - <blockquote> - <pre> - 1 RCU_NONIDLE({ - 2 do_something(); - 3 goto bad_idea; /* BUG!!! */ - 4 do_something_else();}); - 5 bad_idea: - </pre> - </blockquote> - - <p> - It is just as illegal to transfer control into the middle of - <tt>RCU_NONIDLE()</tt>'s argument. - Yes, in theory, you could transfer in as long as you also - transferred out, but in practice you could also expect to get sharply - worded review comments. -</ol> - -<p> -It is similarly socially unacceptable to interrupt an -<tt>nohz_full</tt> CPU running in userspace. -RCU must therefore track <tt>nohz_full</tt> userspace -execution. -RCU must therefore be able to sample state at two points in -time, and be able to determine whether or not some other CPU spent -any time idle and/or executing in userspace. - -<p> -These energy-efficiency requirements have proven quite difficult to -understand and to meet, for example, there have been more than five -clean-sheet rewrites of RCU's energy-efficiency code, the last of -which was finally able to demonstrate -<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>. -As noted earlier, -I learned of many of these requirements via angry phone calls: -Flaming me on the Linux-kernel mailing list was apparently not -sufficient to fully vent their ire at RCU's energy-efficiency bugs! - -<h3><a name="Scheduling-Clock Interrupts and RCU"> -Scheduling-Clock Interrupts and RCU</a></h3> - -<p> -The kernel transitions between in-kernel non-idle execution, userspace -execution, and the idle loop. -Depending on kernel configuration, RCU handles these states differently: - -<table border=3> -<tr><th><tt>HZ</tt> Kconfig</th> - <th>In-Kernel</th> - <th>Usermode</th> - <th>Idle</th></tr> -<tr><th align="left"><tt>HZ_PERIODIC</tt></th> - <td>Can rely on scheduling-clock interrupt.</td> - <td>Can rely on scheduling-clock interrupt and its - detection of interrupt from usermode.</td> - <td>Can rely on RCU's dyntick-idle detection.</td></tr> -<tr><th align="left"><tt>NO_HZ_IDLE</tt></th> - <td>Can rely on scheduling-clock interrupt.</td> - <td>Can rely on scheduling-clock interrupt and its - detection of interrupt from usermode.</td> - <td>Can rely on RCU's dyntick-idle detection.</td></tr> -<tr><th align="left"><tt>NO_HZ_FULL</tt></th> - <td>Can only sometimes rely on scheduling-clock interrupt. - In other cases, it is necessary to bound kernel execution - times and/or use IPIs.</td> - <td>Can rely on RCU's dyntick-idle detection.</td> - <td>Can rely on RCU's dyntick-idle detection.</td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the - scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt> - and <tt>NO_HZ_IDLE</tt> do? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because, as a performance optimization, <tt>NO_HZ_FULL</tt> - does not necessarily re-enable the scheduling-clock interrupt - on entry to each and every system call. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -However, RCU must be reliably informed as to whether any given -CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>, -also whether that CPU is executing in usermode, as discussed -<a href="#Energy Efficiency">earlier</a>. -It also requires that the scheduling-clock interrupt be enabled when -RCU needs it to be: - -<ol> -<li> If a CPU is either idle or executing in usermode, and RCU believes - it is non-idle, the scheduling-clock tick had better be running. - Otherwise, you will get RCU CPU stall warnings. Or at best, - very long (11-second) grace periods, with a pointless IPI waking - the CPU from time to time. -<li> If a CPU is in a portion of the kernel that executes RCU read-side - critical sections, and RCU believes this CPU to be idle, you will get - random memory corruption. <b>DON'T DO THIS!!!</b> - - <br>This is one reason to test with lockdep, which will complain - about this sort of thing. -<li> If a CPU is in a portion of the kernel that is absolutely - positively no-joking guaranteed to never execute any RCU read-side - critical sections, and RCU believes this CPU to to be idle, - no problem. This sort of thing is used by some architectures - for light-weight exception handlers, which can then avoid the - overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> - at exception entry and exit, respectively. - Some go further and avoid the entireties of <tt>irq_enter()</tt> - and <tt>irq_exit()</tt>. - - <br>Just make very sure you are running some of your tests with - <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths - was in fact joking about not doing RCU read-side critical sections. -<li> If a CPU is executing in the kernel with the scheduling-clock - interrupt disabled and RCU believes this CPU to be non-idle, - and if the CPU goes idle (from an RCU perspective) every few - jiffies, no problem. It is usually OK for there to be the - occasional gap between idle periods of up to a second or so. - - <br>If the gap grows too long, you get RCU CPU stall warnings. -<li> If a CPU is either idle or executing in usermode, and RCU believes - it to be idle, of course no problem. -<li> If a CPU is executing in the kernel, the kernel code - path is passing through quiescent states at a reasonable - frequency (preferably about once per few jiffies, but the - occasional excursion to a second or so is usually OK) and the - scheduling-clock interrupt is enabled, of course no problem. - - <br>If the gap between a successive pair of quiescent states grows - too long, you get RCU CPU stall warnings. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But what if my driver has a hardware interrupt handler - that can run for many seconds? - I cannot invoke <tt>schedule()</tt> from an hardware - interrupt handler, after all! -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt> - every so often. - But given that long-running interrupt handlers can cause - other problems, not least for response time, shouldn't you - work to keep your interrupt handler's runtime within reasonable - bounds? -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -But as long as RCU is properly informed of kernel state transitions between -in-kernel execution, usermode execution, and idle, and as long as the -scheduling-clock interrupt is enabled when RCU needs it to be, you -can rest assured that the bugs you encounter will be in some other -part of RCU or some other part of the kernel! - -<h3><a name="Memory Efficiency">Memory Efficiency</a></h3> - -<p> -Although small-memory non-realtime systems can simply use Tiny RCU, -code size is only one aspect of memory efficiency. -Another aspect is the size of the <tt>rcu_head</tt> structure -used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>. -Although this structure contains nothing more than a pair of pointers, -it does appear in many RCU-protected data structures, including -some that are size critical. -The <tt>page</tt> structure is a case in point, as evidenced by -the many occurrences of the <tt>union</tt> keyword within that structure. - -<p> -This need for memory efficiency is one reason that RCU uses hand-crafted -singly linked lists to track the <tt>rcu_head</tt> structures that -are waiting for a grace period to elapse. -It is also the reason why <tt>rcu_head</tt> structures do not contain -debug information, such as fields tracking the file and line of the -<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them. -Although this information might appear in debug-only kernel builds at some -point, in the meantime, the <tt>->func</tt> field will often provide -the needed debug information. - -<p> -However, in some cases, the need for memory efficiency leads to even -more extreme measures. -Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field -shares storage with a great many other structures that are used at -various points in the corresponding page's lifetime. -In order to correctly resolve certain -<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>, -the Linux kernel's memory-management subsystem needs a particular bit -to remain zero during all phases of grace-period processing, -and that bit happens to map to the bottom bit of the -<tt>rcu_head</tt> structure's <tt>->next</tt> field. -RCU makes this guarantee as long as <tt>call_rcu()</tt> -is used to post the callback, as opposed to <tt>kfree_rcu()</tt> -or some future “lazy” -variant of <tt>call_rcu()</tt> that might one day be created for -energy-efficiency purposes. - -<p> -That said, there are limits. -RCU requires that the <tt>rcu_head</tt> structure be aligned to a -two-byte boundary, and passing a misaligned <tt>rcu_head</tt> -structure to one of the <tt>call_rcu()</tt> family of functions -will result in a splat. -It is therefore necessary to exercise caution when packing -structures containing fields of type <tt>rcu_head</tt>. -Why not a four-byte or even eight-byte alignment requirement? -Because the m68k architecture provides only two-byte alignment, -and thus acts as alignment's least common denominator. - -<p> -The reason for reserving the bottom bit of pointers to -<tt>rcu_head</tt> structures is to leave the door open to -“lazy” callbacks whose invocations can safely be deferred. -Deferring invocation could potentially have energy-efficiency -benefits, but only if the rate of non-lazy callbacks decreases -significantly for some important workload. -In the meantime, reserving the bottom bit keeps this option open -in case it one day becomes useful. - -<h3><a name="Performance, Scalability, Response Time, and Reliability"> -Performance, Scalability, Response Time, and Reliability</a></h3> - -<p> -Expanding on the -<a href="#Performance and Scalability">earlier discussion</a>, -RCU is used heavily by hot code paths in performance-critical -portions of the Linux kernel's networking, security, virtualization, -and scheduling code paths. -RCU must therefore use efficient implementations, especially in its -read-side primitives. -To that end, it would be good if preemptible RCU's implementation -of <tt>rcu_read_lock()</tt> could be inlined, however, doing -this requires resolving <tt>#include</tt> issues with the -<tt>task_struct</tt> structure. - -<p> -The Linux kernel supports hardware configurations with up to -4096 CPUs, which means that RCU must be extremely scalable. -Algorithms that involve frequent acquisitions of global locks or -frequent atomic operations on global variables simply cannot be -tolerated within the RCU implementation. -RCU therefore makes heavy use of a combining tree based on the -<tt>rcu_node</tt> structure. -RCU is required to tolerate all CPUs continuously invoking any -combination of RCU's runtime primitives with minimal per-operation -overhead. -In fact, in many cases, increasing load must <i>decrease</i> the -per-operation overhead, witness the batching optimizations for -<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, -<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>. -As a general rule, RCU must cheerfully accept whatever the -rest of the Linux kernel decides to throw at it. - -<p> -The Linux kernel is used for real-time workloads, especially -in conjunction with the -<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>. -The real-time-latency response requirements are such that the -traditional approach of disabling preemption across RCU -read-side critical sections is inappropriate. -Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore -use an RCU implementation that allows RCU read-side critical -sections to be preempted. -This requirement made its presence known after users made it -clear that an earlier -<a href="https://lwn.net/Articles/107930/">real-time patch</a> -did not meet their needs, in conjunction with some -<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a> -encountered by a very early version of the -rt patchset. - -<p> -In addition, RCU must make do with a sub-100-microsecond real-time latency -budget. -In fact, on smaller systems with the -rt patchset, the Linux kernel -provides sub-20-microsecond real-time latencies for the whole kernel, -including RCU. -RCU's scalability and latency must therefore be sufficient for -these sorts of configurations. -To my surprise, the sub-100-microsecond real-time latency budget -<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf"> -applies to even the largest systems [PDF]</a>, -up to and including systems with 4096 CPUs. -This real-time requirement motivated the grace-period kthread, which -also simplified handling of a number of race conditions. - -<p> -RCU must avoid degrading real-time response for CPU-bound threads, whether -executing in usermode (which is one use case for -<tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel. -That said, CPU-bound loops in the kernel must execute -<tt>cond_resched()</tt> at least once per few tens of milliseconds -in order to avoid receiving an IPI from RCU. - -<p> -Finally, RCU's status as a synchronization primitive means that -any RCU failure can result in arbitrary memory corruption that can be -extremely difficult to debug. -This means that RCU must be extremely reliable, which in -practice also means that RCU must have an aggressive stress-test -suite. -This stress-test suite is called <tt>rcutorture</tt>. - -<p> -Although the need for <tt>rcutorture</tt> was no surprise, -the current immense popularity of the Linux kernel is posing -interesting—and perhaps unprecedented—validation -challenges. -To see this, keep in mind that there are well over one billion -instances of the Linux kernel running today, given Android -smartphones, Linux-powered televisions, and servers. -This number can be expected to increase sharply with the advent of -the celebrated Internet of Things. - -<p> -Suppose that RCU contains a race condition that manifests on average -once per million years of runtime. -This bug will be occurring about three times per <i>day</i> across -the installed base. -RCU could simply hide behind hardware error rates, given that no one -should really expect their smartphone to last for a million years. -However, anyone taking too much comfort from this thought should -consider the fact that in most jurisdictions, a successful multi-year -test of a given mechanism, which might include a Linux kernel, -suffices for a number of types of safety-critical certifications. -In fact, rumor has it that the Linux kernel is already being used -in production for safety-critical applications. -I don't know about you, but I would feel quite bad if a bug in RCU -killed someone. -Which might explain my recent focus on validation and verification. - -<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2> - -<p> -One of the more surprising things about RCU is that there are now -no fewer than five <i>flavors</i>, or API families. -In addition, the primary flavor that has been the sole focus up to -this point has two different implementations, non-preemptible and -preemptible. -The other four flavors are listed below, with requirements for each -described in a separate section. - -<ol> -<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a> -<li> <a href="#Sched Flavor">Sched Flavor (Historical)</a> -<li> <a href="#Sleepable RCU">Sleepable RCU</a> -<li> <a href="#Tasks RCU">Tasks RCU</a> -</ol> - -<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a></h3> - -<p> -The RCU-bh flavor of RCU has since been expressed in terms of -the other RCU flavors as part of a consolidation of the three -flavors into a single flavor. -The read-side API remains, and continues to disable softirq and to -be accounted for by lockdep. -Much of the material in this section is therefore strictly historical -in nature. - -<p> -The softirq-disable (AKA “bottom-half”, -hence the “_bh” abbreviations) -flavor of RCU, or <i>RCU-bh</i>, was developed by -Dipankar Sarma to provide a flavor of RCU that could withstand the -network-based denial-of-service attacks researched by Robert -Olsson. -These attacks placed so much networking load on the system -that some of the CPUs never exited softirq execution, -which in turn prevented those CPUs from ever executing a context switch, -which, in the RCU implementation of that time, prevented grace periods -from ever ending. -The result was an out-of-memory condition and a system hang. - -<p> -The solution was the creation of RCU-bh, which does -<tt>local_bh_disable()</tt> -across its read-side critical sections, and which uses the transition -from one type of softirq processing to another as a quiescent state -in addition to context switch, idle, user mode, and offline. -This means that RCU-bh grace periods can complete even when some of -the CPUs execute in softirq indefinitely, thus allowing algorithms -based on RCU-bh to withstand network-based denial-of-service attacks. - -<p> -Because -<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt> -disable and re-enable softirq handlers, any attempt to start a softirq -handlers during the -RCU-bh read-side critical section will be deferred. -In this case, <tt>rcu_read_unlock_bh()</tt> -will invoke softirq processing, which can take considerable time. -One can of course argue that this softirq overhead should be associated -with the code following the RCU-bh read-side critical section rather -than <tt>rcu_read_unlock_bh()</tt>, but the fact -is that most profiling tools cannot be expected to make this sort -of fine distinction. -For example, suppose that a three-millisecond-long RCU-bh read-side -critical section executes during a time of heavy networking load. -There will very likely be an attempt to invoke at least one softirq -handler during that three milliseconds, but any such invocation will -be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>. -This can of course make it appear at first glance as if -<tt>rcu_read_unlock_bh()</tt> was executing very slowly. - -<p> -The -<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a> -includes -<tt>rcu_read_lock_bh()</tt>, -<tt>rcu_read_unlock_bh()</tt>, -<tt>rcu_dereference_bh()</tt>, -<tt>rcu_dereference_bh_check()</tt>, -<tt>synchronize_rcu_bh()</tt>, -<tt>synchronize_rcu_bh_expedited()</tt>, -<tt>call_rcu_bh()</tt>, -<tt>rcu_barrier_bh()</tt>, and -<tt>rcu_read_lock_bh_held()</tt>. -However, the update-side APIs are now simple wrappers for other RCU -flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt -otherwise. - -<h3><a name="Sched Flavor">Sched Flavor (Historical)</a></h3> - -<p> -The RCU-sched flavor of RCU has since been expressed in terms of -the other RCU flavors as part of a consolidation of the three -flavors into a single flavor. -The read-side API remains, and continues to disable preemption and to -be accounted for by lockdep. -Much of the material in this section is therefore strictly historical -in nature. - -<p> -Before preemptible RCU, waiting for an RCU grace period had the -side effect of also waiting for all pre-existing interrupt -and NMI handlers. -However, there are legitimate preemptible-RCU implementations that -do not have this property, given that any point in the code outside -of an RCU read-side critical section can be a quiescent state. -Therefore, <i>RCU-sched</i> was created, which follows “classic” -RCU in that an RCU-sched grace period waits for for pre-existing -interrupt and NMI handlers. -In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched -APIs have identical implementations, while kernels built with -<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each. - -<p> -Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels, -<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> -disable and re-enable preemption, respectively. -This means that if there was a preemption attempt during the -RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt> -will enter the scheduler, with all the latency and overhead entailed. -Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look -as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly. -However, the highest-priority task won't be preempted, so that task -will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations. - -<p> -The -<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a> -includes -<tt>rcu_read_lock_sched()</tt>, -<tt>rcu_read_unlock_sched()</tt>, -<tt>rcu_read_lock_sched_notrace()</tt>, -<tt>rcu_read_unlock_sched_notrace()</tt>, -<tt>rcu_dereference_sched()</tt>, -<tt>rcu_dereference_sched_check()</tt>, -<tt>synchronize_sched()</tt>, -<tt>synchronize_rcu_sched_expedited()</tt>, -<tt>call_rcu_sched()</tt>, -<tt>rcu_barrier_sched()</tt>, and -<tt>rcu_read_lock_sched_held()</tt>. -However, anything that disables preemption also marks an RCU-sched -read-side critical section, including -<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>, -<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>, -and so on. - -<h3><a name="Sleepable RCU">Sleepable RCU</a></h3> - -<p> -For well over a decade, someone saying “I need to block within -an RCU read-side critical section” was a reliable indication -that this someone did not understand RCU. -After all, if you are always blocking in an RCU read-side critical -section, you can probably afford to use a higher-overhead synchronization -mechanism. -However, that changed with the advent of the Linux kernel's notifiers, -whose RCU read-side critical -sections almost never sleep, but sometimes need to. -This resulted in the introduction of -<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>, -or <i>SRCU</i>. - -<p> -SRCU allows different domains to be defined, with each such domain -defined by an instance of an <tt>srcu_struct</tt> structure. -A pointer to this structure must be passed in to each SRCU function, -for example, <tt>synchronize_srcu(&ss)</tt>, where -<tt>ss</tt> is the <tt>srcu_struct</tt> structure. -The key benefit of these domains is that a slow SRCU reader in one -domain does not delay an SRCU grace period in some other domain. -That said, one consequence of these domains is that read-side code -must pass a “cookie” from <tt>srcu_read_lock()</tt> -to <tt>srcu_read_unlock()</tt>, for example, as follows: - -<blockquote> -<pre> - 1 int idx; - 2 - 3 idx = srcu_read_lock(&ss); - 4 do_something(); - 5 srcu_read_unlock(&ss, idx); -</pre> -</blockquote> - -<p> -As noted above, it is legal to block within SRCU read-side critical sections, -however, with great power comes great responsibility. -If you block forever in one of a given domain's SRCU read-side critical -sections, then that domain's grace periods will also be blocked forever. -Of course, one good way to block forever is to deadlock, which can -happen if any operation in a given domain's SRCU read-side critical -section can wait, either directly or indirectly, for that domain's -grace period to elapse. -For example, this results in a self-deadlock: - -<blockquote> -<pre> - 1 int idx; - 2 - 3 idx = srcu_read_lock(&ss); - 4 do_something(); - 5 synchronize_srcu(&ss); - 6 srcu_read_unlock(&ss, idx); -</pre> -</blockquote> - -<p> -However, if line 5 acquired a mutex that was held across -a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>, -deadlock would still be possible. -Furthermore, if line 5 acquired a mutex that was held across -a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>, -and if an <tt>ss1</tt>-domain SRCU read-side critical section -acquired another mutex that was held across as <tt>ss</tt>-domain -<tt>synchronize_srcu()</tt>, -deadlock would again be possible. -Such a deadlock cycle could extend across an arbitrarily large number -of different SRCU domains. -Again, with great power comes great responsibility. - -<p> -Unlike the other RCU flavors, SRCU read-side critical sections can -run on idle and even offline CPUs. -This ability requires that <tt>srcu_read_lock()</tt> and -<tt>srcu_read_unlock()</tt> contain memory barriers, which means -that SRCU readers will run a bit slower than would RCU readers. -It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt> -API, which, in combination with <tt>srcu_read_unlock()</tt>, -guarantees a full memory barrier. - -<p> -Also unlike other RCU flavors, <tt>synchronize_srcu()</tt> may <b>not</b> -be invoked from CPU-hotplug notifiers, due to the fact that SRCU grace -periods make use of timers and the possibility of timers being temporarily -“stranded” on the outgoing CPU. -This stranding of timers means that timers posted to the outgoing CPU -will not fire until late in the CPU-hotplug process. -The problem is that if a notifier is waiting on an SRCU grace period, -that grace period is waiting on a timer, and that timer is stranded on the -outgoing CPU, then the notifier will never be awakened, in other words, -deadlock has occurred. -This same situation of course also prohibits <tt>srcu_barrier()</tt> -from being invoked from CPU-hotplug notifiers. - -<p> -SRCU also differs from other RCU flavors in that SRCU's expedited and -non-expedited grace periods are implemented by the same mechanism. -This means that in the current SRCU implementation, expediting a -future grace period has the side effect of expediting all prior -grace periods that have not yet completed. -(But please note that this is a property of the current implementation, -not necessarily of future implementations.) -In addition, if SRCU has been idle for longer than the interval -specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter -(25 microseconds by default), -and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, -that invocation will be automatically expedited. - -<p> -As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating -a locking bottleneck present in prior kernel versions. -Although this will allow users to put much heavier stress on -<tt>call_srcu()</tt>, it is important to note that SRCU does not -yet take any special steps to deal with callback flooding. -So if you are posting (say) 10,000 SRCU callbacks per second per CPU, -you are probably totally OK, but if you intend to post (say) 1,000,000 -SRCU callbacks per second per CPU, please run some tests first. -SRCU just might need a few adjustment to deal with that sort of load. -Of course, your mileage may vary based on the speed of your CPUs and -the size of your memory. - -<p> -The -<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a> -includes -<tt>srcu_read_lock()</tt>, -<tt>srcu_read_unlock()</tt>, -<tt>srcu_dereference()</tt>, -<tt>srcu_dereference_check()</tt>, -<tt>synchronize_srcu()</tt>, -<tt>synchronize_srcu_expedited()</tt>, -<tt>call_srcu()</tt>, -<tt>srcu_barrier()</tt>, and -<tt>srcu_read_lock_held()</tt>. -It also includes -<tt>DEFINE_SRCU()</tt>, -<tt>DEFINE_STATIC_SRCU()</tt>, and -<tt>init_srcu_struct()</tt> -APIs for defining and initializing <tt>srcu_struct</tt> structures. - -<h3><a name="Tasks RCU">Tasks RCU</a></h3> - -<p> -Some forms of tracing use “trampolines” to handle the -binary rewriting required to install different types of probes. -It would be good to be able to free old trampolines, which sounds -like a job for some form of RCU. -However, because it is necessary to be able to install a trace -anywhere in the code, it is not possible to use read-side markers -such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. -In addition, it does not work to have these markers in the trampoline -itself, because there would need to be instructions following -<tt>rcu_read_unlock()</tt>. -Although <tt>synchronize_rcu()</tt> would guarantee that execution -reached the <tt>rcu_read_unlock()</tt>, it would not be able to -guarantee that execution had completely left the trampoline. - -<p> -The solution, in the form of -<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>, -is to have implicit -read-side critical sections that are delimited by voluntary context -switches, that is, calls to <tt>schedule()</tt>, -<tt>cond_resched()</tt>, and -<tt>synchronize_rcu_tasks()</tt>. -In addition, transitions to and from userspace execution also delimit -tasks-RCU read-side critical sections. - -<p> -The tasks-RCU API is quite compact, consisting only of -<tt>call_rcu_tasks()</tt>, -<tt>synchronize_rcu_tasks()</tt>, and -<tt>rcu_barrier_tasks()</tt>. -In <tt>CONFIG_PREEMPT=n</tt> kernels, trampolines cannot be preempted, -so these APIs map to -<tt>call_rcu()</tt>, -<tt>synchronize_rcu()</tt>, and -<tt>rcu_barrier()</tt>, respectively. -In <tt>CONFIG_PREEMPT=y</tt> kernels, trampolines can be preempted, -and these three APIs are therefore implemented by separate functions -that check for voluntary context switches. - -<h2><a name="Possible Future Changes">Possible Future Changes</a></h2> - -<p> -One of the tricks that RCU uses to attain update-side scalability is -to increase grace-period latency with increasing numbers of CPUs. -If this becomes a serious problem, it will be necessary to rework the -grace-period state machine so as to avoid the need for the additional -latency. - -<p> -RCU disables CPU hotplug in a few places, perhaps most notably in the -<tt>rcu_barrier()</tt> operations. -If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug -notifiers, it will be necessary to avoid disabling CPU hotplug. -This would introduce some complexity, so there had better be a <i>very</i> -good reason. - -<p> -The tradeoff between grace-period latency on the one hand and interruptions -of other CPUs on the other hand may need to be re-examined. -The desire is of course for zero grace-period latency as well as zero -interprocessor interrupts undertaken during an expedited grace period -operation. -While this ideal is unlikely to be achievable, it is quite possible that -further improvements can be made. - -<p> -The multiprocessor implementations of RCU use a combining tree that -groups CPUs so as to reduce lock contention and increase cache locality. -However, this combining tree does not spread its memory across NUMA -nodes nor does it align the CPU groups with hardware features such -as sockets or cores. -Such spreading and alignment is currently believed to be unnecessary -because the hotpath read-side primitives do not access the combining -tree, nor does <tt>call_rcu()</tt> in the common case. -If you believe that your architecture needs such spreading and alignment, -then your architecture should also benefit from the -<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set -to the number of CPUs in a socket, NUMA node, or whatever. -If the number of CPUs is too large, use a fraction of the number of -CPUs. -If the number of CPUs is a large prime number, well, that certainly -is an “interesting” architectural choice! -More flexible arrangements might be considered, but only if -<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only -if the inadequacy has been demonstrated by a carefully run and -realistic system-level workload. - -<p> -Please note that arrangements that require RCU to remap CPU numbers will -require extremely good demonstration of need and full exploration of -alternatives. - -<p> -RCU's various kthreads are reasonably recent additions. -It is quite likely that adjustments will be required to more gracefully -handle extreme loads. -It might also be necessary to be able to relate CPU utilization by -RCU's kthreads and softirq handlers to the code that instigated this -CPU utilization. -For example, RCU callback overhead might be charged back to the -originating <tt>call_rcu()</tt> instance, though probably not -in production kernels. - -<p> -Additional work may be required to provide reasonable forward-progress -guarantees under heavy load for grace periods and for callback -invocation. - -<h2><a name="Summary">Summary</a></h2> - -<p> -This document has presented more than two decade's worth of RCU -requirements. -Given that the requirements keep changing, this will not be the last -word on this subject, but at least it serves to get an important -subset of the requirements set forth. - -<h2><a name="Acknowledgments">Acknowledgments</a></h2> - -I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, -Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and -Andy Lutomirski for their help in rendering -this article human readable, and to Michelle Rankin for her support -of this effort. -Other contributions are acknowledged in the Linux kernel's git archive. - -</body></html> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst new file mode 100644 index 000000000000..04ed8bf27a0e --- /dev/null +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -0,0 +1,2760 @@ +================================= +A Tour Through RCU's Requirements +================================= + +Copyright IBM Corporation, 2015 + +Author: Paul E. McKenney + +The initial version of this document appeared in the +`LWN <https://lwn.net/>`_ on those articles: +`part 1 <https://lwn.net/Articles/652156/>`_, +`part 2 <https://lwn.net/Articles/652677/>`_, and +`part 3 <https://lwn.net/Articles/653326/>`_. + +Introduction +------------ + +Read-copy update (RCU) is a synchronization mechanism that is often used +as a replacement for reader-writer locking. RCU is unusual in that +updaters do not block readers, which means that RCU's read-side +primitives can be exceedingly fast and scalable. In addition, updaters +can make useful forward progress concurrently with readers. However, all +this concurrency between RCU readers and updaters does raise the +question of exactly what RCU readers are doing, which in turn raises the +question of exactly what RCU's requirements are. + +This document therefore summarizes RCU's requirements, and can be +thought of as an informal, high-level specification for RCU. It is +important to understand that RCU's specification is primarily empirical +in nature; in fact, I learned about many of these requirements the hard +way. This situation might cause some consternation, however, not only +has this learning process been a lot of fun, but it has also been a +great privilege to work with so many people willing to apply +technologies in interesting new ways. + +All that aside, here are the categories of currently known RCU +requirements: + +#. `Fundamental Requirements`_ +#. `Fundamental Non-Requirements`_ +#. `Parallelism Facts of Life`_ +#. `Quality-of-Implementation Requirements`_ +#. `Linux Kernel Complications`_ +#. `Software-Engineering Requirements`_ +#. `Other RCU Flavors`_ +#. `Possible Future Changes`_ + +This is followed by a summary_, however, the answers to +each quick quiz immediately follows the quiz. Select the big white space +with your mouse to see the answer. + +Fundamental Requirements +------------------------ + +RCU's fundamental requirements are the closest thing RCU has to hard +mathematical requirements. These are: + +#. `Grace-Period Guarantee`_ +#. `Publish/Subscribe Guarantee`_ +#. `Memory-Barrier Guarantees`_ +#. `RCU Primitives Guaranteed to Execute Unconditionally`_ +#. `Guaranteed Read-to-Write Upgrade`_ + +Grace-Period Guarantee +~~~~~~~~~~~~~~~~~~~~~~ + +RCU's grace-period guarantee is unusual in being premeditated: Jack +Slingwine and I had this guarantee firmly in mind when we started work +on RCU (then called “rclock”) in the early 1990s. That said, the past +two decades of experience with RCU have produced a much more detailed +understanding of this guarantee. + +RCU's grace-period guarantee allows updaters to wait for the completion +of all pre-existing RCU read-side critical sections. An RCU read-side +critical section begins with the marker rcu_read_lock() and ends +with the marker rcu_read_unlock(). These markers may be nested, and +RCU treats a nested set as one big RCU read-side critical section. +Production-quality implementations of rcu_read_lock() and +rcu_read_unlock() are extremely lightweight, and in fact have +exactly zero overhead in Linux kernels built for production use with +``CONFIG_PREEMPTION=n``. + +This guarantee allows ordering to be enforced with extremely low +overhead to readers, for example: + + :: + + 1 int x, y; + 2 + 3 void thread0(void) + 4 { + 5 rcu_read_lock(); + 6 r1 = READ_ONCE(x); + 7 r2 = READ_ONCE(y); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 WRITE_ONCE(x, 1); + 14 synchronize_rcu(); + 15 WRITE_ONCE(y, 1); + 16 } + +Because the synchronize_rcu() on line 14 waits for all pre-existing +readers, any instance of thread0() that loads a value of zero from +``x`` must complete before thread1() stores to ``y``, so that +instance must also load a value of zero from ``y``. Similarly, any +instance of thread0() that loads a value of one from ``y`` must have +started after the synchronize_rcu() started, and must therefore also +load a value of one from ``x``. Therefore, the outcome: + + :: + + (r1 == 0 && r2 == 1) + +cannot happen. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that updaters can make useful forward | +| progress concurrently with readers, but pre-existing readers will | +| block synchronize_rcu()!!! | +| Just who are you trying to fool??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| First, if updaters do not wish to be blocked by readers, they can use | +| call_rcu() or kfree_rcu(), which will be discussed later. | +| Second, even when using synchronize_rcu(), the other update-side | +| code does run concurrently with readers, whether pre-existing or not. | ++-----------------------------------------------------------------------+ + +This scenario resembles one of the first uses of RCU in +`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a +distributed lock manager's transition into a state suitable for handling +recovery from node failure, more or less as follows: + + :: + + 1 #define STATE_NORMAL 0 + 2 #define STATE_WANT_RECOVERY 1 + 3 #define STATE_RECOVERING 2 + 4 #define STATE_WANT_NORMAL 3 + 5 + 6 int state = STATE_NORMAL; + 7 + 8 void do_something_dlm(void) + 9 { + 10 int state_snap; + 11 + 12 rcu_read_lock(); + 13 state_snap = READ_ONCE(state); + 14 if (state_snap == STATE_NORMAL) + 15 do_something(); + 16 else + 17 do_something_carefully(); + 18 rcu_read_unlock(); + 19 } + 20 + 21 void start_recovery(void) + 22 { + 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); + 24 synchronize_rcu(); + 25 WRITE_ONCE(state, STATE_RECOVERING); + 26 recovery(); + 27 WRITE_ONCE(state, STATE_WANT_NORMAL); + 28 synchronize_rcu(); + 29 WRITE_ONCE(state, STATE_NORMAL); + 30 } + +The RCU read-side critical section in do_something_dlm() works with +the synchronize_rcu() in start_recovery() to guarantee that +do_something() never runs concurrently with recovery(), but with +little or no synchronization overhead in do_something_dlm(). + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why is the synchronize_rcu() on line 28 needed? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Without that extra grace period, memory reordering could result in | +| do_something_dlm() executing do_something() concurrently with | +| the last bits of recovery(). | ++-----------------------------------------------------------------------+ + +In order to avoid fatal problems such as deadlocks, an RCU read-side +critical section must not contain calls to synchronize_rcu(). +Similarly, an RCU read-side critical section must not contain anything +that waits, directly or indirectly, on completion of an invocation of +synchronize_rcu(). + +Although RCU's grace-period guarantee is useful in and of itself, with +`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would +be good to be able to use RCU to coordinate read-side access to linked +data structures. For this, the grace-period guarantee is not sufficient, +as can be seen in function add_gp_buggy() below. We will look at the +reader's code later, but in the meantime, just think of the reader as +locklessly picking up the ``gp`` pointer, and, if the value loaded is +non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. + + :: + + 1 bool add_gp_buggy(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 gp = p; /* ORDERING BUG */ + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The problem is that both the compiler and weakly ordered CPUs are within +their rights to reorder this code as follows: + + :: + + 1 bool add_gp_buggy_optimized(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 gp = p; /* ORDERING BUG */ + 12 p->a = a; + 13 p->b = a; + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` +executes line 11, it will see garbage in the ``->a`` and ``->b`` fields. +And this is but one of many ways in which compiler and hardware +optimizations could cause trouble. Therefore, we clearly need some way +to prevent the compiler and the CPU from reordering in this manner, +which brings us to the publish-subscribe guarantee discussed in the next +section. + +Publish/Subscribe Guarantee +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RCU's publish-subscribe guarantee allows data to be inserted into a +linked data structure without disrupting RCU readers. The updater uses +rcu_assign_pointer() to insert the new data, and readers use +rcu_dereference() to access data, whether new or old. The following +shows an example of insertion: + + :: + + 1 bool add_gp(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 rcu_assign_pointer(gp, p); + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The rcu_assign_pointer() on line 13 is conceptually equivalent to a +simple assignment statement, but also guarantees that its assignment +will happen after the two assignments in lines 11 and 12, similar to the +C11 ``memory_order_release`` store operation. It also prevents any +number of “interesting” compiler optimizations, for example, the use of +``gp`` as a scratch location immediately preceding the assignment. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But rcu_assign_pointer() does nothing to prevent the two | +| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | +| also cause problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, it cannot. The readers cannot see either of these two fields | +| until the assignment to ``gp``, by which time both fields are fully | +| initialized. So reordering the assignments to ``p->a`` and ``p->b`` | +| cannot possibly cause any problems. | ++-----------------------------------------------------------------------+ + +It is tempting to assume that the reader need not do anything special to +control its accesses to the RCU-protected data, as shown in +do_something_gp_buggy() below: + + :: + + 1 bool do_something_gp_buggy(void) + 2 { + 3 rcu_read_lock(); + 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +However, this temptation must be resisted because there are a +surprisingly large number of ways that the compiler (or weak ordering +CPUs like the DEC Alpha) can trip this code up. For but one example, if +the compiler were short of registers, it might choose to refetch from +``gp`` rather than keeping a separate copy in ``p`` as follows: + + :: + + 1 bool do_something_gp_buggy_optimized(void) + 2 { + 3 rcu_read_lock(); + 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ + 5 do_something(gp->a, gp->b); + 6 rcu_read_unlock(); + 7 return true; + 8 } + 9 rcu_read_unlock(); + 10 return false; + 11 } + +If this function ran concurrently with a series of updates that replaced +the current structure with a new one, the fetches of ``gp->a`` and +``gp->b`` might well come from two different structures, which could +cause serious confusion. To prevent this (and much else besides), +do_something_gp() uses rcu_dereference() to fetch from ``gp``: + + :: + + 1 bool do_something_gp(void) + 2 { + 3 rcu_read_lock(); + 4 p = rcu_dereference(gp); + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +The rcu_dereference() uses volatile casts and (for DEC Alpha) memory +barriers in the Linux kernel. Should a |high-quality implementation of +C11 memory_order_consume [PDF]|_ +ever appear, then rcu_dereference() could be implemented as a +``memory_order_consume`` load. Regardless of the exact implementation, a +pointer fetched by rcu_dereference() may not be used outside of the +outermost RCU read-side critical section containing that +rcu_dereference(), unless protection of the corresponding data +element has been passed from RCU to some other synchronization +mechanism, most commonly locking or reference counting +(see ../../rcuref.rst). + +.. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF] +.. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf + +In short, updaters use rcu_assign_pointer() and readers use +rcu_dereference(), and these two RCU API elements work together to +ensure that readers have a consistent view of newly added data elements. + +Of course, it is also necessary to remove elements from RCU-protected +data structures, for example, using the following process: + +#. Remove the data element from the enclosing structure. +#. Wait for all pre-existing RCU read-side critical sections to complete + (because only pre-existing readers can possibly have a reference to + the newly removed data element). +#. At this point, only the updater has a reference to the newly removed + data element, so it can safely reclaim the data element, for example, + by passing it to kfree(). + +This process is implemented by remove_gp_synchronous(): + + :: + + 1 bool remove_gp_synchronous(void) + 2 { + 3 struct foo *p; + 4 + 5 spin_lock(&gp_lock); + 6 p = rcu_access_pointer(gp); + 7 if (!p) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 rcu_assign_pointer(gp, NULL); + 12 spin_unlock(&gp_lock); + 13 synchronize_rcu(); + 14 kfree(p); + 15 return true; + 16 } + +This function is straightforward, with line 13 waiting for a grace +period before line 14 frees the old data element. This waiting ensures +that readers will reach line 7 of do_something_gp() before the data +element referenced by ``p`` is freed. The rcu_access_pointer() on +line 6 is similar to rcu_dereference(), except that: + +#. The value returned by rcu_access_pointer() cannot be + dereferenced. If you want to access the value pointed to as well as + the pointer itself, use rcu_dereference() instead of + rcu_access_pointer(). +#. The call to rcu_access_pointer() need not be protected. In + contrast, rcu_dereference() must either be within an RCU + read-side critical section or in a code segment where the pointer + cannot change, for example, in code protected by the corresponding + update-side lock. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Without the rcu_dereference() or the rcu_access_pointer(), | +| what destructive optimizations might the compiler make use of? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Let's start with what happens to do_something_gp() if it fails to | +| use rcu_dereference(). It could reuse a value formerly fetched | +| from this same pointer. It could also fetch the pointer from ``gp`` | +| in a byte-at-a-time manner, resulting in *load tearing*, in turn | +| resulting a bytewise mash-up of two distinct pointer values. It might | +| even use value-speculation optimizations, where it makes a wrong | +| guess, but by the time it gets around to checking the value, an | +| update has changed the pointer to match the wrong guess. Too bad | +| about any dereferences that returned pre-initialization garbage in | +| the meantime! | +| For remove_gp_synchronous(), as long as all modifications to | +| ``gp`` are carried out while holding ``gp_lock``, the above | +| optimizations are harmless. However, ``sparse`` will complain if you | +| define ``gp`` with ``__rcu`` and then access it without using either | +| rcu_access_pointer() or rcu_dereference(). | ++-----------------------------------------------------------------------+ + +In short, RCU's publish-subscribe guarantee is provided by the +combination of rcu_assign_pointer() and rcu_dereference(). This +guarantee allows data elements to be safely added to RCU-protected +linked data structures without disrupting RCU readers. This guarantee +can be used in combination with the grace-period guarantee to also allow +data elements to be removed from RCU-protected linked data structures, +again without disrupting RCU readers. + +This guarantee was only partially premeditated. DYNIX/ptx used an +explicit memory barrier for publication, but had nothing resembling +rcu_dereference() for subscription, nor did it have anything +resembling the dependency-ordering barrier that was later subsumed +into rcu_dereference() and later still into READ_ONCE(). The +need for these operations made itself known quite suddenly at a +late-1990s meeting with the DEC Alpha architects, back in the days when +DEC was still a free-standing company. It took the Alpha architects a +good hour to convince me that any sort of barrier would ever be needed, +and it then took me a good *two* hours to convince them that their +documentation did not make this point clear. More recent work with the C +and C++ standards committees have provided much education on tricks and +traps from the compiler. In short, compilers were much less tricky in +the early 1990s, but in 2015, don't even think about omitting +rcu_dereference()! + +Memory-Barrier Guarantees +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The previous section's simple linked-data-structure scenario clearly +demonstrates the need for RCU's stringent memory-ordering guarantees on +systems with more than one CPU: + +#. Each CPU that has an RCU read-side critical section that begins + before synchronize_rcu() starts is guaranteed to execute a full + memory barrier between the time that the RCU read-side critical + section ends and the time that synchronize_rcu() returns. Without + this guarantee, a pre-existing RCU read-side critical section might + hold a reference to the newly removed ``struct foo`` after the + kfree() on line 14 of remove_gp_synchronous(). +#. Each CPU that has an RCU read-side critical section that ends after + synchronize_rcu() returns is guaranteed to execute a full memory + barrier between the time that synchronize_rcu() begins and the + time that the RCU read-side critical section begins. Without this + guarantee, a later RCU read-side critical section running after the + kfree() on line 14 of remove_gp_synchronous() might later run + do_something_gp() and find the newly deleted ``struct foo``. +#. If the task invoking synchronize_rcu() remains on a given CPU, + then that CPU is guaranteed to execute a full memory barrier sometime + during the execution of synchronize_rcu(). This guarantee ensures + that the kfree() on line 14 of remove_gp_synchronous() really + does execute after the removal on line 11. +#. If the task invoking synchronize_rcu() migrates among a group of + CPUs during that invocation, then each of the CPUs in that group is + guaranteed to execute a full memory barrier sometime during the + execution of synchronize_rcu(). This guarantee also ensures that + the kfree() on line 14 of remove_gp_synchronous() really does + execute after the removal on line 11, but also in the case where the + thread executing the synchronize_rcu() migrates in the meantime. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that multiple CPUs can start RCU read-side critical sections at | +| any time without any ordering whatsoever, how can RCU possibly tell | +| whether or not a given RCU read-side critical section starts before a | +| given instance of synchronize_rcu()? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| If RCU cannot tell whether or not a given RCU read-side critical | +| section starts before a given instance of synchronize_rcu(), then | +| it must assume that the RCU read-side critical section started first. | +| In other words, a given instance of synchronize_rcu() can avoid | +| waiting on a given RCU read-side critical section only if it can | +| prove that synchronize_rcu() started first. | +| A related question is “When rcu_read_lock() doesn't generate any | +| code, why does it matter how it relates to a grace period?” The | +| answer is that it is not the relationship of rcu_read_lock() | +| itself that is important, but rather the relationship of the code | +| within the enclosed RCU read-side critical section to the code | +| preceding and following the grace period. If we take this viewpoint, | +| then a given RCU read-side critical section begins before a given | +| grace period when some access preceding the grace period observes the | +| effect of some access within the critical section, in which case none | +| of the accesses within the critical section may observe the effects | +| of any access following the grace period. | +| | +| As of late 2016, mathematical models of RCU take this viewpoint, for | +| example, see slides 62 and 63 of the `2016 LinuxCon | +| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 | +| 6.10.04c.LCE.pdf>`__ | +| presentation. | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The first and second guarantees require unbelievably strict ordering! | +| Are all these memory barriers *really* required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Yes, they really are required. To see why the first guarantee is | +| required, consider the following sequence of events: | +| | +| #. CPU 1: rcu_read_lock() | +| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: synchronize_rcu() starts. | +| #. CPU 1: ``do_something_with(q->a);`` | +| ``/* No smp_mb(), so might happen after kfree(). */`` | +| #. CPU 1: rcu_read_unlock() | +| #. CPU 0: synchronize_rcu() returns. | +| #. CPU 0: ``kfree(p);`` | +| | +| Therefore, there absolutely must be a full memory barrier between the | +| end of the RCU read-side critical section and the end of the grace | +| period. | +| | +| The sequence of events demonstrating the necessity of the second rule | +| is roughly similar: | +| | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: synchronize_rcu() starts. | +| #. CPU 1: rcu_read_lock() | +| #. CPU 1: ``q = rcu_dereference(gp);`` | +| ``/* Might return p if no memory barrier. */`` | +| #. CPU 0: synchronize_rcu() returns. | +| #. CPU 0: ``kfree(p);`` | +| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | +| #. CPU 1: rcu_read_unlock() | +| | +| And similarly, without a memory barrier between the beginning of the | +| grace period and the beginning of the RCU read-side critical section, | +| CPU 1 might end up accessing the freelist. | +| | +| The “as if” rule of course applies, so that any implementation that | +| acts as if the appropriate memory barriers were in place is a correct | +| implementation. That said, it is much easier to fool yourself into | +| believing that you have adhered to the as-if rule than it is to | +| actually adhere to it! | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| You claim that rcu_read_lock() and rcu_read_unlock() generate | +| absolutely no code in some kernel builds. This means that the | +| compiler might arbitrarily rearrange consecutive RCU read-side | +| critical sections. Given such rearrangement, if a given RCU read-side | +| critical section is done, how can you be sure that all prior RCU | +| read-side critical sections are done? Won't the compiler | +| rearrangements make that impossible to determine? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In cases where rcu_read_lock() and rcu_read_unlock() generate | +| absolutely no code, RCU infers quiescent states only at special | +| locations, for example, within the scheduler. Because calls to | +| schedule() had better prevent calling-code accesses to shared | +| variables from being rearranged across the call to schedule(), if | +| RCU detects the end of a given RCU read-side critical section, it | +| will necessarily detect the end of all prior RCU read-side critical | +| sections, no matter how aggressively the compiler scrambles the code. | +| Again, this all assumes that the compiler cannot scramble code across | +| calls to the scheduler, out of interrupt handlers, into the idle | +| loop, into user-mode code, and so on. But if your kernel build allows | +| that sort of scrambling, you have broken far more than just RCU! | ++-----------------------------------------------------------------------+ + +Note that these memory-barrier requirements do not replace the +fundamental RCU requirement that a grace period wait for all +pre-existing readers. On the contrary, the memory barriers called out in +this section must operate in such a way as to *enforce* this fundamental +requirement. Of course, different implementations enforce this +requirement in different ways, but enforce it they must. + +RCU Primitives Guaranteed to Execute Unconditionally +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The common-case RCU primitives are unconditional. They are invoked, they +do their job, and they return, with no possibility of error, and no need +to retry. This is a key RCU design philosophy. + +However, this philosophy is pragmatic rather than pigheaded. If someone +comes up with a good justification for a particular conditional RCU +primitive, it might well be implemented and added. After all, this +guarantee was reverse-engineered, not premeditated. The unconditional +nature of the RCU primitives was initially an accident of +implementation, and later experience with synchronization primitives +with conditional primitives caused me to elevate this accident to a +guarantee. Therefore, the justification for adding a conditional +primitive to RCU would need to be based on detailed and compelling use +cases. + +Guaranteed Read-to-Write Upgrade +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As far as RCU is concerned, it is always possible to carry out an update +within an RCU read-side critical section. For example, that RCU +read-side critical section might search for a given data element, and +then might acquire the update-side spinlock in order to update that +element, all while remaining in that RCU read-side critical section. Of +course, it is necessary to exit the RCU read-side critical section +before invoking synchronize_rcu(), however, this inconvenience can +be avoided through use of the call_rcu() and kfree_rcu() API +members described later in this document. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But how does the upgrade-to-write operation exclude other readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It doesn't, just like normal RCU updates, which also do not exclude | +| RCU readers. | ++-----------------------------------------------------------------------+ + +This guarantee allows lookup code to be shared between read-side and +update-side code, and was premeditated, appearing in the earliest +DYNIX/ptx RCU documentation. + +Fundamental Non-Requirements +---------------------------- + +RCU provides extremely lightweight readers, and its read-side +guarantees, though quite useful, are correspondingly lightweight. It is +therefore all too easy to assume that RCU is guaranteeing more than it +really is. Of course, the list of things that RCU does not guarantee is +infinitely long, however, the following sections list a few +non-guarantees that have caused confusion. Except where otherwise noted, +these non-guarantees were premeditated. + +#. `Readers Impose Minimal Ordering`_ +#. `Readers Do Not Exclude Updaters`_ +#. `Updaters Only Wait For Old Readers`_ +#. `Grace Periods Don't Partition Read-Side Critical Sections`_ +#. `Read-Side Critical Sections Don't Partition Grace Periods`_ + +Readers Impose Minimal Ordering +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Reader-side markers such as rcu_read_lock() and +rcu_read_unlock() provide absolutely no ordering guarantees except +through their interaction with the grace-period APIs such as +synchronize_rcu(). To see this, consider the following pair of +threads: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(x, 1); + 5 rcu_read_unlock(); + 6 rcu_read_lock(); + 7 WRITE_ONCE(y, 1); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 rcu_read_lock(); + 14 r1 = READ_ONCE(y); + 15 rcu_read_unlock(); + 16 rcu_read_lock(); + 17 r2 = READ_ONCE(x); + 18 rcu_read_unlock(); + 19 } + +After thread0() and thread1() execute concurrently, it is quite +possible to have + + :: + + (r1 == 1 && r2 == 0) + +(that is, ``y`` appears to have been assigned before ``x``), which would +not be possible if rcu_read_lock() and rcu_read_unlock() had +much in the way of ordering properties. But they do not, so the CPU is +within its rights to do significant reordering. This is by design: Any +significant ordering constraints would slow down these fast-path APIs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Can't the compiler also reorder this code? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, the volatile casts in READ_ONCE() and WRITE_ONCE() | +| prevent the compiler from reordering in this particular case. | ++-----------------------------------------------------------------------+ + +Readers Do Not Exclude Updaters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Neither rcu_read_lock() nor rcu_read_unlock() exclude updates. +All they do is to prevent grace periods from ending. The following +example illustrates this: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 r1 = READ_ONCE(y); + 5 if (r1) { + 6 do_something_with_nonzero_x(); + 7 r2 = READ_ONCE(x); + 8 WARN_ON(!r2); /* BUG!!! */ + 9 } + 10 rcu_read_unlock(); + 11 } + 12 + 13 void thread1(void) + 14 { + 15 spin_lock(&my_lock); + 16 WRITE_ONCE(x, 1); + 17 WRITE_ONCE(y, 1); + 18 spin_unlock(&my_lock); + 19 } + +If the thread0() function's rcu_read_lock() excluded the +thread1() function's update, the WARN_ON() could never fire. But +the fact is that rcu_read_lock() does not exclude much of anything +aside from subsequent grace periods, of which thread1() has none, so +the WARN_ON() can and does fire. + +Updaters Only Wait For Old Readers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It might be tempting to assume that after synchronize_rcu() +completes, there are no readers executing. This temptation must be +avoided because new readers can start immediately after +synchronize_rcu() starts, and synchronize_rcu() is under no +obligation to wait for these new readers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that synchronize_rcu() did wait until *all* readers had | +| completed instead of waiting only on pre-existing readers. For how | +| long would the updater be able to rely on there being no readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| For no time at all. Even if synchronize_rcu() were to wait until | +| all readers had completed, a new reader might start immediately after | +| synchronize_rcu() completed. Therefore, the code following | +| synchronize_rcu() can *never* rely on there being no readers. | ++-----------------------------------------------------------------------+ + +Grace Periods Don't Partition Read-Side Critical Sections +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is tempting to assume that if any part of one RCU read-side critical +section precedes a given grace period, and if any part of another RCU +read-side critical section follows that same grace period, then all of +the first RCU read-side critical section must precede all of the second. +However, this just isn't the case: A single grace period does not +partition the set of RCU read-side critical sections. An example of this +situation can be illustrated as follows, where ``x``, ``y``, and ``z`` +are initially all zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 r2 = READ_ONCE(b); + 20 r3 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + +It turns out that the outcome: + + :: + + (r1 == 1 && r2 == 0 && r3 == 1) + +is entirely possible. The following figure show how this can happen, +with each circled ``QS`` indicating the point at which RCU recorded a +*quiescent state* for each thread, that is, a state in which RCU knows +that the thread cannot be in the midst of an RCU read-side critical +section that started before the current grace period: + +.. kernel-figure:: GPpartitionReaders1.svg + +If it is necessary to partition RCU read-side critical sections in this +manner, it is necessary to use two grace periods, where the first grace +period is known to end before the second grace period starts: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 r2 = READ_ONCE(c); + 19 synchronize_rcu(); + 20 WRITE_ONCE(d, 1); + 21 } + 22 + 23 void thread3(void) + 24 { + 25 rcu_read_lock(); + 26 r3 = READ_ONCE(b); + 27 r4 = READ_ONCE(d); + 28 rcu_read_unlock(); + 29 } + +Here, if ``(r1 == 1)``, then thread0()'s write to ``b`` must happen +before the end of thread1()'s grace period. If in addition +``(r4 == 1)``, then thread3()'s read from ``b`` must happen after +the beginning of thread2()'s grace period. If it is also the case +that ``(r2 == 1)``, then the end of thread1()'s grace period must +precede the beginning of thread2()'s grace period. This mean that +the two RCU read-side critical sections cannot overlap, guaranteeing +that ``(r3 == 1)``. As a result, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) + +cannot happen. + +This non-requirement was also non-premeditated, but became apparent when +studying RCU's interaction with memory ordering. + +Read-Side Critical Sections Don't Partition Grace Periods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is also tempting to assume that if an RCU read-side critical section +happens between a pair of grace periods, then those grace periods cannot +overlap. However, this temptation leads nowhere good, as can be +illustrated by the following, with all variables initially zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 WRITE_ONCE(d, 1); + 20 r2 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + 23 + 24 void thread3(void) + 25 { + 26 r3 = READ_ONCE(d); + 27 synchronize_rcu(); + 28 WRITE_ONCE(e, 1); + 29 } + 30 + 31 void thread4(void) + 32 { + 33 rcu_read_lock(); + 34 r4 = READ_ONCE(b); + 35 r5 = READ_ONCE(e); + 36 rcu_read_unlock(); + 37 } + +In this case, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) + +is entirely possible, as illustrated below: + +.. kernel-figure:: ReadersPartitionGP1.svg + +Again, an RCU read-side critical section can overlap almost all of a +given grace period, just so long as it does not overlap the entire grace +period. As a result, an RCU read-side critical section cannot partition +a pair of RCU grace periods. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How long a sequence of grace periods, each separated by an RCU | +| read-side critical section, would be required to partition the RCU | +| read-side critical sections at the beginning and end of the chain? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In theory, an infinite number. In practice, an unknown number that is | +| sensitive to both implementation details and timing considerations. | +| Therefore, even in practice, RCU users must abide by the theoretical | +| rather than the practical answer. | ++-----------------------------------------------------------------------+ + +Parallelism Facts of Life +------------------------- + +These parallelism facts of life are by no means specific to RCU, but the +RCU implementation must abide by them. They therefore bear repeating: + +#. Any CPU or task may be delayed at any time, and any attempts to avoid + these delays by disabling preemption, interrupts, or whatever are + completely futile. This is most obvious in preemptible user-level + environments and in virtualized environments (where a given guest + OS's VCPUs can be preempted at any time by the underlying + hypervisor), but can also happen in bare-metal environments due to + ECC errors, NMIs, and other hardware events. Although a delay of more + than about 20 seconds can result in splats, the RCU implementation is + obligated to use algorithms that can tolerate extremely long delays, + but where “extremely long” is not long enough to allow wrap-around + when incrementing a 64-bit counter. +#. Both the compiler and the CPU can reorder memory accesses. Where it + matters, RCU must use compiler directives and memory-barrier + instructions to preserve ordering. +#. Conflicting writes to memory locations in any given cache line will + result in expensive cache misses. Greater numbers of concurrent + writes and more-frequent concurrent writes will result in more + dramatic slowdowns. RCU is therefore obligated to use algorithms that + have sufficient locality to avoid significant performance and + scalability problems. +#. As a rough rule of thumb, only one CPU's worth of processing may be + carried out under the protection of any given exclusive lock. RCU + must therefore use scalable locking designs. +#. Counters are finite, especially on 32-bit systems. RCU's use of + counters must therefore tolerate counter wrap, or be designed such + that counter wrap would take way more time than a single system is + likely to run. An uptime of ten years is quite possible, a runtime of + a century much less so. As an example of the latter, RCU's + dyntick-idle nesting counter allows 54 bits for interrupt nesting + level (this counter is 64 bits even on a 32-bit system). Overflowing + this counter requires 2\ :sup:`54` half-interrupts on a given CPU + without that CPU ever going idle. If a half-interrupt happened every + microsecond, it would take 570 years of runtime to overflow this + counter, which is currently believed to be an acceptably long time. +#. Linux systems can have thousands of CPUs running a single Linux + kernel in a single shared-memory environment. RCU must therefore pay + close attention to high-end scalability. + +This last parallelism fact of life means that RCU must pay special +attention to the preceding facts of life. The idea that Linux might +scale to systems with thousands of CPUs would have been met with some +skepticism in the 1990s, but these requirements would have otherwise +have been unsurprising, even in the early 1990s. + +Quality-of-Implementation Requirements +-------------------------------------- + +These sections list quality-of-implementation requirements. Although an +RCU implementation that ignores these requirements could still be used, +it would likely be subject to limitations that would make it +inappropriate for industrial-strength production use. Classes of +quality-of-implementation requirements are as follows: + +#. `Specialization`_ +#. `Performance and Scalability`_ +#. `Forward Progress`_ +#. `Composability`_ +#. `Corner Cases`_ + +These classes is covered in the following sections. + +Specialization +~~~~~~~~~~~~~~ + +RCU is and always has been intended primarily for read-mostly +situations, which means that RCU's read-side primitives are optimized, +often at the expense of its update-side primitives. Experience thus far +is captured by the following list of situations: + +#. Read-mostly data, where stale and inconsistent data is not a problem: + RCU works great! +#. Read-mostly data, where data must be consistent: RCU works well. +#. Read-write data, where data must be consistent: RCU *might* work OK. + Or not. +#. Write-mostly data, where data must be consistent: RCU is very + unlikely to be the right tool for the job, with the following + exceptions, where RCU can provide: + + a. Existence guarantees for update-friendly mechanisms. + b. Wait-free read-side primitives for real-time use. + +This focus on read-mostly situations means that RCU must interoperate +with other synchronization primitives. For example, the add_gp() and +remove_gp_synchronous() examples discussed earlier use RCU to +protect readers and locking to coordinate updaters. However, the need +extends much farther, requiring that a variety of synchronization +primitives be legal within RCU read-side critical sections, including +spinlocks, sequence locks, atomic operations, reference counters, and +memory barriers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What about sleeping locks? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| These are forbidden within Linux-kernel RCU read-side critical | +| sections because it is not legal to place a quiescent state (in this | +| case, voluntary context switch) within an RCU read-side critical | +| section. However, sleeping locks may be used within userspace RCU | +| read-side critical sections, and also within Linux-kernel sleepable | +| RCU `(SRCU) <Sleepable RCU_>`__ read-side critical sections. In | +| addition, the -rt patchset turns spinlocks into a sleeping locks so | +| that the corresponding critical sections can be preempted, which also | +| means that these sleeplockified spinlocks (but not other sleeping | +| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | +| sections. | +| Note that it *is* legal for a normal RCU read-side critical section | +| to conditionally acquire a sleeping locks (as in | +| mutex_trylock()), but only as long as it does not loop | +| indefinitely attempting to conditionally acquire that sleeping locks. | +| The key point is that things like mutex_trylock() either return | +| with the mutex held, or return an error indication if the mutex was | +| not immediately available. Either way, mutex_trylock() returns | +| immediately without sleeping. | ++-----------------------------------------------------------------------+ + +It often comes as a surprise that many algorithms do not require a +consistent view of data, but many can function in that mode, with +network routing being the poster child. Internet routing algorithms take +significant time to propagate updates, so that by the time an update +arrives at a given system, that system has been sending network traffic +the wrong way for a considerable length of time. Having a few threads +continue to send traffic the wrong way for a few more milliseconds is +clearly not a problem: In the worst case, TCP retransmissions will +eventually get the data where it needs to go. In general, when tracking +the state of the universe outside of the computer, some level of +inconsistency must be tolerated due to speed-of-light delays if nothing +else. + +Furthermore, uncertainty about external state is inherent in many cases. +For example, a pair of veterinarians might use heartbeat to determine +whether or not a given cat was alive. But how long should they wait +after the last heartbeat to decide that the cat is in fact dead? Waiting +less than 400 milliseconds makes no sense because this would mean that a +relaxed cat would be considered to cycle between death and life more +than 100 times per minute. Moreover, just as with human beings, a cat's +heart might stop for some period of time, so the exact wait period is a +judgment call. One of our pair of veterinarians might wait 30 seconds +before pronouncing the cat dead, while the other might insist on waiting +a full minute. The two veterinarians would then disagree on the state of +the cat during the final 30 seconds of the minute following the last +heartbeat. + +Interestingly enough, this same situation applies to hardware. When push +comes to shove, how do we tell whether or not some external server has +failed? We send messages to it periodically, and declare it failed if we +don't receive a response within a given period of time. Policy decisions +can usually tolerate short periods of inconsistency. The policy was +decided some time ago, and is only now being put into effect, so a few +milliseconds of delay is normally inconsequential. + +However, there are algorithms that absolutely must see consistent data. +For example, the translation between a user-level SystemV semaphore ID +to the corresponding in-kernel data structure is protected by RCU, but +it is absolutely forbidden to update a semaphore that has just been +removed. In the Linux kernel, this need for consistency is accommodated +by acquiring spinlocks located in the in-kernel data structure from +within the RCU read-side critical section, and this is indicated by the +green box in the figure above. Many other techniques may be used, and +are in fact used within the Linux kernel. + +In short, RCU is not required to maintain consistency, and other +mechanisms may be used in concert with RCU when consistency is required. +RCU's specialization allows it to do its job extremely well, and its +ability to interoperate with other synchronization mechanisms allows the +right mix of synchronization tools to be used for a given job. + +Performance and Scalability +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Energy efficiency is a critical component of performance today, and +Linux-kernel RCU implementations must therefore avoid unnecessarily +awakening idle CPUs. I cannot claim that this requirement was +premeditated. In fact, I learned of it during a telephone conversation +in which I was given “frank and open” feedback on the importance of +energy efficiency in battery-powered systems and on specific +energy-efficiency shortcomings of the Linux-kernel RCU implementation. +In my experience, the battery-powered embedded community will consider +any unnecessary wakeups to be extremely unfriendly acts. So much so that +mere Linux-kernel-mailing-list posts are insufficient to vent their ire. + +Memory consumption is not particularly important for in most situations, +and has become decreasingly so as memory sizes have expanded and memory +costs have plummeted. However, as I learned from Matt Mackall's +`bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory +footprint is critically important on single-CPU systems with +non-preemptible (``CONFIG_PREEMPTION=n``) kernels, and thus `tiny +RCU <https://lore.kernel.org/r/20090113221724.GA15307@linux.vnet.ibm.com>`__ +was born. Josh Triplett has since taken over the small-memory banner +with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__ +project, which resulted in `SRCU <Sleepable RCU_>`__ becoming optional +for those kernels not needing it. + +The remaining performance requirements are, for the most part, +unsurprising. For example, in keeping with RCU's read-side +specialization, rcu_dereference() should have negligible overhead +(for example, suppression of a few minor compiler optimizations). +Similarly, in non-preemptible environments, rcu_read_lock() and +rcu_read_unlock() should have exactly zero overhead. + +In preemptible environments, in the case where the RCU read-side +critical section was not preempted (as will be the case for the +highest-priority real-time process), rcu_read_lock() and +rcu_read_unlock() should have minimal overhead. In particular, they +should not contain atomic read-modify-write operations, memory-barrier +instructions, preemption disabling, interrupt disabling, or backwards +branches. However, in the case where the RCU read-side critical section +was preempted, rcu_read_unlock() may acquire spinlocks and disable +interrupts. This is why it is better to nest an RCU read-side critical +section within a preempt-disable region than vice versa, at least in +cases where that critical section is short enough to avoid unduly +degrading real-time latencies. + +The synchronize_rcu() grace-period-wait primitive is optimized for +throughput. It may therefore incur several milliseconds of latency in +addition to the duration of the longest RCU read-side critical section. +On the other hand, multiple concurrent invocations of +synchronize_rcu() are required to use batching optimizations so that +they can be satisfied by a single underlying grace-period-wait +operation. For example, in the Linux kernel, it is not unusual for a +single grace-period-wait operation to serve more than `1,000 separate +invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__ +of synchronize_rcu(), thus amortizing the per-invocation overhead +down to nearly zero. However, the grace-period optimization is also +required to avoid measurable degradation of real-time scheduling and +interrupt latencies. + +In some cases, the multi-millisecond synchronize_rcu() latencies are +unacceptable. In these cases, synchronize_rcu_expedited() may be +used instead, reducing the grace-period latency down to a few tens of +microseconds on small systems, at least in cases where the RCU read-side +critical sections are short. There are currently no special latency +requirements for synchronize_rcu_expedited() on large systems, but, +consistent with the empirical nature of the RCU specification, that is +subject to change. However, there most definitely are scalability +requirements: A storm of synchronize_rcu_expedited() invocations on +4096 CPUs should at least make reasonable forward progress. In return +for its shorter latencies, synchronize_rcu_expedited() is permitted +to impose modest degradation of real-time latency on non-idle online +CPUs. Here, “modest” means roughly the same latency degradation as a +scheduling-clock interrupt. + +There are a number of situations where even +synchronize_rcu_expedited()'s reduced grace-period latency is +unacceptable. In these situations, the asynchronous call_rcu() can +be used in place of synchronize_rcu() as follows: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 static void remove_gp_cb(struct rcu_head *rhp) + 8 { + 9 struct foo *p = container_of(rhp, struct foo, rh); + 10 + 11 kfree(p); + 12 } + 13 + 14 bool remove_gp_asynchronous(void) + 15 { + 16 struct foo *p; + 17 + 18 spin_lock(&gp_lock); + 19 p = rcu_access_pointer(gp); + 20 if (!p) { + 21 spin_unlock(&gp_lock); + 22 return false; + 23 } + 24 rcu_assign_pointer(gp, NULL); + 25 call_rcu(&p->rh, remove_gp_cb); + 26 spin_unlock(&gp_lock); + 27 return true; + 28 } + +A definition of ``struct foo`` is finally needed, and appears on +lines 1-5. The function remove_gp_cb() is passed to call_rcu() +on line 25, and will be invoked after the end of a subsequent grace +period. This gets the same effect as remove_gp_synchronous(), but +without forcing the updater to wait for a grace period to elapse. The +call_rcu() function may be used in a number of situations where +neither synchronize_rcu() nor synchronize_rcu_expedited() would +be legal, including within preempt-disable code, local_bh_disable() +code, interrupt-disable code, and interrupt handlers. However, even +call_rcu() is illegal within NMI handlers and from idle and offline +CPUs. The callback function (remove_gp_cb() in this case) will be +executed within softirq (software interrupt) environment within the +Linux kernel, either within a real softirq handler or under the +protection of local_bh_disable(). In both the Linux kernel and in +userspace, it is bad practice to write an RCU callback function that +takes too long. Long-running operations should be relegated to separate +threads or (in the Linux kernel) workqueues. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why does line 19 use rcu_access_pointer()? After all, | +| call_rcu() on line 25 stores into the structure, which would | +| interact badly with concurrent insertions. Doesn't this mean that | +| rcu_dereference() is required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Presumably the ``->gp_lock`` acquired on line 18 excludes any | +| changes, including any insertions that rcu_dereference() would | +| protect against. Therefore, any insertions will be delayed until | +| after ``->gp_lock`` is released on line 25, which in turn means that | +| rcu_access_pointer() suffices. | ++-----------------------------------------------------------------------+ + +However, all that remove_gp_cb() is doing is invoking kfree() on +the data element. This is a common idiom, and is supported by +kfree_rcu(), which allows “fire and forget” operation as shown +below: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 bool remove_gp_faf(void) + 8 { + 9 struct foo *p; + 10 + 11 spin_lock(&gp_lock); + 12 p = rcu_dereference(gp); + 13 if (!p) { + 14 spin_unlock(&gp_lock); + 15 return false; + 16 } + 17 rcu_assign_pointer(gp, NULL); + 18 kfree_rcu(p, rh); + 19 spin_unlock(&gp_lock); + 20 return true; + 21 } + +Note that remove_gp_faf() simply invokes kfree_rcu() and +proceeds, without any need to pay any further attention to the +subsequent grace period and kfree(). It is permissible to invoke +kfree_rcu() from the same environments as for call_rcu(). +Interestingly enough, DYNIX/ptx had the equivalents of call_rcu() +and kfree_rcu(), but not synchronize_rcu(). This was due to the +fact that RCU was not heavily used within DYNIX/ptx, so the very few +places that needed something like synchronize_rcu() simply +open-coded it. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Earlier it was claimed that call_rcu() and kfree_rcu() | +| allowed updaters to avoid being blocked by readers. But how can that | +| be correct, given that the invocation of the callback and the freeing | +| of the memory (respectively) must still wait for a grace period to | +| elapse? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| We could define things this way, but keep in mind that this sort of | +| definition would say that updates in garbage-collected languages | +| cannot complete until the next time the garbage collector runs, which | +| does not seem at all reasonable. The key point is that in most cases, | +| an updater using either call_rcu() or kfree_rcu() can proceed | +| to the next update as soon as it has invoked call_rcu() or | +| kfree_rcu(), without having to wait for a subsequent grace | +| period. | ++-----------------------------------------------------------------------+ + +But what if the updater must wait for the completion of code to be +executed after the end of the grace period, but has other tasks that can +be carried out in the meantime? The polling-style +get_state_synchronize_rcu() and cond_synchronize_rcu() functions +may be used for this purpose, as shown below: + + :: + + 1 bool remove_gp_poll(void) + 2 { + 3 struct foo *p; + 4 unsigned long s; + 5 + 6 spin_lock(&gp_lock); + 7 p = rcu_access_pointer(gp); + 8 if (!p) { + 9 spin_unlock(&gp_lock); + 10 return false; + 11 } + 12 rcu_assign_pointer(gp, NULL); + 13 spin_unlock(&gp_lock); + 14 s = get_state_synchronize_rcu(); + 15 do_something_while_waiting(); + 16 cond_synchronize_rcu(s); + 17 kfree(p); + 18 return true; + 19 } + +On line 14, get_state_synchronize_rcu() obtains a “cookie” from RCU, +then line 15 carries out other tasks, and finally, line 16 returns +immediately if a grace period has elapsed in the meantime, but otherwise +waits as required. The need for ``get_state_synchronize_rcu`` and +cond_synchronize_rcu() has appeared quite recently, so it is too +early to tell whether they will stand the test of time. + +RCU thus provides a range of tools to allow updaters to strike the +required tradeoff between latency, flexibility and CPU overhead. + +Forward Progress +~~~~~~~~~~~~~~~~ + +In theory, delaying grace-period completion and callback invocation is +harmless. In practice, not only are memory sizes finite but also +callbacks sometimes do wakeups, and sufficiently deferred wakeups can be +difficult to distinguish from system hangs. Therefore, RCU must provide +a number of mechanisms to promote forward progress. + +These mechanisms are not foolproof, nor can they be. For one simple +example, an infinite loop in an RCU read-side critical section must by +definition prevent later grace periods from ever completing. For a more +involved example, consider a 64-CPU system built with +``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where +CPUs 1 through 63 spin in tight loops that invoke call_rcu(). Even +if these tight loops also contain calls to cond_resched() (thus +allowing grace periods to complete), CPU 0 simply will not be able to +invoke callbacks as fast as the other 63 CPUs can register them, at +least not until the system runs out of memory. In both of these +examples, the Spiderman principle applies: With great power comes great +responsibility. However, short of this level of abuse, RCU is required +to ensure timely completion of grace periods and timely invocation of +callbacks. + +RCU takes the following steps to encourage timely completion of grace +periods: + +#. If a grace period fails to complete within 100 milliseconds, RCU + causes future invocations of cond_resched() on the holdout CPUs + to provide an RCU quiescent state. RCU also causes those CPUs' + need_resched() invocations to return ``true``, but only after the + corresponding CPU's next scheduling-clock. +#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run + indefinitely in the kernel without scheduling-clock interrupts, which + defeats the above need_resched() strategem. RCU will therefore + invoke resched_cpu() on any ``nohz_full`` CPUs still holding out + after 109 milliseconds. +#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that + has been preempted within an RCU read-side critical section is + holding out for more than 500 milliseconds, RCU will resort to + priority boosting. +#. If a CPU is still holding out 10 seconds into the grace period, RCU + will invoke resched_cpu() on it regardless of its ``nohz_full`` + state. + +The above values are defaults for systems running with ``HZ=1000``. They +will vary as the value of ``HZ`` varies, and can also be changed using +the relevant Kconfig options and kernel boot parameters. RCU currently +does not do much sanity checking of these parameters, so please use +caution when changing them. Note that these forward-progress measures +are provided only for RCU, not for `SRCU <Sleepable RCU_>`__ or `Tasks +RCU`_. + +RCU takes the following steps in call_rcu() to encourage timely +invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has +10,000 callbacks, or has 10,000 more callbacks than it had the last time +encouragement was provided: + +#. Starts a grace period, if one is not already in progress. +#. Forces immediate checking for quiescent states, rather than waiting + for three milliseconds to have elapsed since the beginning of the + grace period. +#. Immediately tags the CPU's callbacks with their grace period + completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` + handler to get around to it. +#. Lifts callback-execution batch limits, which speeds up callback + invocation at the expense of degrading realtime response. + +Again, these are default values when running at ``HZ=1000``, and can be +overridden. Again, these forward-progress measures are provided only for +RCU, not for `SRCU <Sleepable RCU_>`__ or `Tasks +RCU`_. Even for RCU, callback-invocation forward +progress for ``rcu_nocbs`` CPUs is much less well-developed, in part +because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke +call_rcu() relatively infrequently. If workloads emerge that need +both ``rcu_nocbs`` CPUs and high call_rcu() invocation rates, then +additional forward-progress work will be required. + +Composability +~~~~~~~~~~~~~ + +Composability has received much attention in recent years, perhaps in +part due to the collision of multicore hardware with object-oriented +techniques designed in single-threaded environments for single-threaded +use. And in theory, RCU read-side critical sections may be composed, and +in fact may be nested arbitrarily deeply. In practice, as with all +real-world implementations of composable constructs, there are +limitations. + +Implementations of RCU for which rcu_read_lock() and +rcu_read_unlock() generate no code, such as Linux-kernel RCU when +``CONFIG_PREEMPTION=n``, can be nested arbitrarily deeply. After all, there +is no overhead. Except that if all these instances of +rcu_read_lock() and rcu_read_unlock() are visible to the +compiler, compilation will eventually fail due to exhausting memory, +mass storage, or user patience, whichever comes first. If the nesting is +not visible to the compiler, as is the case with mutually recursive +functions each in its own translation unit, stack overflow will result. +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable will overflow or (in the Linux +kernel) you will get an RCU CPU stall warning. Nevertheless, this class +of RCU implementations is one of the most composable constructs in +existence. + +RCU implementations that explicitly track nesting depth are limited by +the nesting-depth counter. For example, the Linux kernel's preemptible +RCU limits nesting to ``INT_MAX``. This should suffice for almost all +practical purposes. That said, a consecutive pair of RCU read-side +critical sections between which there is an operation that waits for a +grace period cannot be enclosed in another RCU read-side critical +section. This is because it is not legal to wait for a grace period +within an RCU read-side critical section: To do so would result either +in deadlock or in RCU implicitly splitting the enclosing RCU read-side +critical section, neither of which is conducive to a long-lived and +prosperous kernel. + +It is worth noting that RCU is not alone in limiting composability. For +example, many transactional-memory implementations prohibit composing a +pair of transactions separated by an irrevocable operation (for example, +a network receive operation). For another example, lock-based critical +sections can be composed surprisingly freely, but only if deadlock is +avoided. + +In short, although RCU read-side critical sections are highly +composable, care is required in some situations, just as is the case for +any other composable synchronization mechanism. + +Corner Cases +~~~~~~~~~~~~ + +A given RCU workload might have an endless and intense stream of RCU +read-side critical sections, perhaps even so intense that there was +never a point in time during which there was not at least one RCU +read-side critical section in flight. RCU cannot allow this situation to +block grace periods: As long as all the RCU read-side critical sections +are finite, grace periods must also be finite. + +That said, preemptible RCU implementations could potentially result in +RCU read-side critical sections being preempted for long durations, +which has the effect of creating a long-duration RCU read-side critical +section. This situation can arise only in heavily loaded systems, but +systems using real-time priorities are of course more vulnerable. +Therefore, RCU priority boosting is provided to help deal with this +case. That said, the exact requirements on RCU priority boosting will +likely evolve as more experience accumulates. + +Other workloads might have very high update rates. Although one can +argue that such workloads should instead use something other than RCU, +the fact remains that RCU must handle such workloads gracefully. This +requirement is another factor driving batching of grace periods, but it +is also the driving force behind the checks for large numbers of queued +RCU callbacks in the call_rcu() code path. Finally, high update +rates should not delay RCU read-side critical sections, although some +small read-side delays can occur when using +synchronize_rcu_expedited(), courtesy of this function's use of +smp_call_function_single(). + +Although all three of these corner cases were understood in the early +1990s, a simple user-level test consisting of ``close(open(path))`` in a +tight loop in the early 2000s suddenly provided a much deeper +appreciation of the high-update-rate corner case. This test also +motivated addition of some RCU code to react to high update rates, for +example, if a given CPU finds itself with more than 10,000 RCU callbacks +queued, it will cause RCU to take evasive action by more aggressively +starting grace periods and more aggressively forcing completion of +grace-period processing. This evasive action causes the grace period to +complete more quickly, but at the cost of restricting RCU's batching +optimizations, thus increasing the CPU overhead incurred by that grace +period. + +Software-Engineering Requirements +--------------------------------- + +Between Murphy's Law and “To err is human”, it is necessary to guard +against mishaps and misuse: + +#. It is all too easy to forget to use rcu_read_lock() everywhere + that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will + splat if rcu_dereference() is used outside of an RCU read-side + critical section. Update-side code can use + rcu_dereference_protected(), which takes a `lockdep + expression <https://lwn.net/Articles/371986/>`__ to indicate what is + providing the protection. If the indicated protection is not + provided, a lockdep splat is emitted. + Code shared between readers and updaters can use + rcu_dereference_check(), which also takes a lockdep expression, + and emits a lockdep splat if neither rcu_read_lock() nor the + indicated protection is in place. In addition, + rcu_dereference_raw() is used in those (hopefully rare) cases + where the required protection cannot be easily described. Finally, + rcu_read_lock_held() is provided to allow a function to verify + that it has been invoked within an RCU read-side critical section. I + was made aware of this set of requirements shortly after Thomas + Gleixner audited a number of RCU uses. +#. A given function might wish to check for RCU-related preconditions + upon entry, before using any other RCU API. The + rcu_lockdep_assert() does this job, asserting the expression in + kernels having lockdep enabled and doing nothing otherwise. +#. It is also easy to forget to use rcu_assign_pointer() and + rcu_dereference(), perhaps (incorrectly) substituting a simple + assignment. To catch this sort of error, a given RCU-protected + pointer may be tagged with ``__rcu``, after which sparse will + complain about simple-assignment accesses to that pointer. Arnd + Bergmann made me aware of this requirement, and also supplied the + needed `patch series <https://lwn.net/Articles/376011/>`__. +#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if + a data element is passed to call_rcu() twice in a row, without a + grace period in between. (This error is similar to a double free.) + The corresponding ``rcu_head`` structures that are dynamically + allocated are automatically tracked, but ``rcu_head`` structures + allocated on the stack must be initialized with + init_rcu_head_on_stack() and cleaned up with + destroy_rcu_head_on_stack(). Similarly, statically allocated + non-stack ``rcu_head`` structures must be initialized with + init_rcu_head() and cleaned up with destroy_rcu_head(). + Mathieu Desnoyers made me aware of this requirement, and also + supplied the needed + `patch <https://lore.kernel.org/r/20100319013024.GA28456@Krystal>`__. +#. An infinite loop in an RCU read-side critical section will eventually + trigger an RCU CPU stall warning splat, with the duration of + “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` + ``Kconfig`` option, or, alternatively, by the + ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU + is not obligated to produce this splat unless there is a grace period + waiting on that particular RCU read-side critical section. + + Some extreme workloads might intentionally delay RCU grace periods, + and systems running those workloads can be booted with + ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This + kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU + stall warnings are counter-productive during sysrq dumps and during + panics. RCU therefore supplies the rcu_sysrq_start() and + rcu_sysrq_end() API members to be called before and after long + sysrq dumps. RCU also supplies the rcu_panic() notifier that is + automatically invoked at the beginning of a panic to suppress further + RCU CPU stall warnings. + + This requirement made itself known in the early 1990s, pretty much + the first time that it was necessary to debug a CPU stall. That said, + the initial implementation in DYNIX/ptx was quite generic in + comparison with that of Linux. + +#. Although it would be very good to detect pointers leaking out of RCU + read-side critical sections, there is currently no good way of doing + this. One complication is the need to distinguish between pointers + leaking and pointers that have been handed off from RCU to some other + synchronization mechanism, for example, reference counting. +#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information + is provided via event tracing. +#. Open-coded use of rcu_assign_pointer() and rcu_dereference() + to create typical linked data structures can be surprisingly + error-prone. Therefore, RCU-protected `linked + lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and, + more recently, RCU-protected `hash + tables <https://lwn.net/Articles/612100/>`__ are available. Many + other special-purpose RCU-protected data structures are available in + the Linux kernel and the userspace RCU library. +#. Some linked structures are created at compile time, but still require + ``__rcu`` checking. The RCU_POINTER_INITIALIZER() macro serves + this purpose. +#. It is not necessary to use rcu_assign_pointer() when creating + linked structures that are to be published via a single external + pointer. The RCU_INIT_POINTER() macro is provided for this task. + +This not a hard-and-fast list: RCU's diagnostic capabilities will +continue to be guided by the number and type of usage bugs found in +real-world RCU usage. + +Linux Kernel Complications +-------------------------- + +The Linux kernel provides an interesting environment for all kinds of +software, including RCU. Some of the relevant points of interest are as +follows: + +#. `Configuration`_ +#. `Firmware Interface`_ +#. `Early Boot`_ +#. `Interrupts and NMIs`_ +#. `Loadable Modules`_ +#. `Hotplug CPU`_ +#. `Scheduler and RCU`_ +#. `Tracing and RCU`_ +#. `Accesses to User Memory and RCU`_ +#. `Energy Efficiency`_ +#. `Scheduling-Clock Interrupts and RCU`_ +#. `Memory Efficiency`_ +#. `Performance, Scalability, Response Time, and Reliability`_ + +This list is probably incomplete, but it does give a feel for the most +notable Linux-kernel complications. Each of the following sections +covers one of the above topics. + +Configuration +~~~~~~~~~~~~~ + +RCU's goal is automatic configuration, so that almost nobody needs to +worry about RCU's ``Kconfig`` options. And for almost all users, RCU +does in fact work well “out of the box.” + +However, there are specialized use cases that are handled by kernel boot +parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` +system will explicitly ask users about new ``Kconfig`` options, which +requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` +``Kconfig`` option. + +This all should be quite obvious, but the fact remains that Linus +Torvalds recently had to +`remind <https://lore.kernel.org/r/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__ +me of this requirement. + +Firmware Interface +~~~~~~~~~~~~~~~~~~ + +In many cases, kernel obtains information about the system from the +firmware, and sometimes things are lost in translation. Or the +translation is accurate, but the original message is bogus. + +For example, some systems' firmware overreports the number of CPUs, +sometimes by a large factor. If RCU naively believed the firmware, as it +used to do, it would create too many per-CPU kthreads. Although the +resulting system will still run correctly, the extra kthreads needlessly +consume memory and can cause confusion when they show up in ``ps`` +listings. + +RCU must therefore wait for a given CPU to actually come online before +it can allow itself to believe that the CPU actually exists. The +resulting “ghost CPUs” (which are never going to come online) cause a +number of `interesting +complications <https://paulmck.livejournal.com/37494.html>`__. + +Early Boot +~~~~~~~~~~ + +The Linux kernel's boot sequence is an interesting process, and RCU is +used early, even before rcu_init() is invoked. In fact, a number of +RCU's primitives can be used as soon as the initial task's +``task_struct`` is available and the boot CPU's per-CPU variables are +set up. The read-side primitives (rcu_read_lock(), +rcu_read_unlock(), rcu_dereference(), and +rcu_access_pointer()) will operate normally very early on, as will +rcu_assign_pointer(). + +Although call_rcu() may be invoked at any time during boot, +callbacks are not guaranteed to be invoked until after all of RCU's +kthreads have been spawned, which occurs at early_initcall() time. +This delay in callback invocation is due to the fact that RCU does not +invoke callbacks until it is fully initialized, and this full +initialization cannot occur until after the scheduler has initialized +itself to the point where RCU can spawn and run its kthreads. In theory, +it would be possible to invoke callbacks earlier, however, this is not a +panacea because there would be severe restrictions on what operations +those callbacks could invoke. + +Perhaps surprisingly, synchronize_rcu() and +synchronize_rcu_expedited(), will operate normally during very early +boot, the reason being that there is only one CPU and preemption is +disabled. This means that the call synchronize_rcu() (or friends) +itself is a quiescent state and thus a grace period, so the early-boot +implementation can be a no-op. + +However, once the scheduler has spawned its first kthread, this early +boot trick fails for synchronize_rcu() (as well as for +synchronize_rcu_expedited()) in ``CONFIG_PREEMPTION=y`` kernels. The +reason is that an RCU read-side critical section might be preempted, +which means that a subsequent synchronize_rcu() really does have to +wait for something, as opposed to simply returning immediately. +Unfortunately, synchronize_rcu() can't do this until all of its +kthreads are spawned, which doesn't happen until some time during +early_initcalls() time. But this is no excuse: RCU is nevertheless +required to correctly handle synchronous grace periods during this time +period. Once all of its kthreads are up and running, RCU starts running +normally. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How can RCU possibly handle grace periods before all of its kthreads | +| have been spawned??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Very carefully! | +| During the “dead zone” between the time that the scheduler spawns the | +| first task and the time that all of RCU's kthreads have been spawned, | +| all synchronous grace periods are handled by the expedited | +| grace-period mechanism. At runtime, this expedited mechanism relies | +| on workqueues, but during the dead zone the requesting task itself | +| drives the desired expedited grace period. Because dead-zone | +| execution takes place within task context, everything works. Once the | +| dead zone ends, expedited grace periods go back to using workqueues, | +| as is required to avoid problems that would otherwise occur when a | +| user task received a POSIX signal while driving an expedited grace | +| period. | +| | +| And yes, this does mean that it is unhelpful to send POSIX signals to | +| random tasks between the time that the scheduler spawns its first | +| kthread and the time that RCU's kthreads have all been spawned. If | +| there ever turns out to be a good reason for sending POSIX signals | +| during that time, appropriate adjustments will be made. (If it turns | +| out that POSIX signals are sent during this time for no good reason, | +| other adjustments will be made, appropriate or otherwise.) | ++-----------------------------------------------------------------------+ + +I learned of these boot-time requirements as a result of a series of +system hangs. + +Interrupts and NMIs +~~~~~~~~~~~~~~~~~~~ + +The Linux kernel has interrupts, and RCU read-side critical sections are +legal within interrupt handlers and within interrupt-disabled regions of +code, as are invocations of call_rcu(). + +Some Linux-kernel architectures can enter an interrupt handler from +non-idle process context, and then just never leave it, instead +stealthily transitioning back to process context. This trick is +sometimes used to invoke system calls from inside the kernel. These +“half-interrupts” mean that RCU has to be very careful about how it +counts interrupt nesting levels. I learned of this requirement the hard +way during a rewrite of RCU's dyntick-idle code. + +The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side +critical sections are legal within NMI handlers. Thankfully, RCU +update-side primitives, including call_rcu(), are prohibited within +NMI handlers. + +The name notwithstanding, some Linux-kernel architectures can have +nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised +me <https://lore.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__ +with this requirement; he also kindly surprised me with `an +algorithm <https://lore.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__ +that meets this requirement. + +Furthermore, NMI handlers can be interrupted by what appear to RCU to be +normal interrupts. One way that this can happen is for code that +directly invokes rcu_irq_enter() and rcu_irq_exit() to be called +from an NMI handler. This astonishing fact of life prompted the current +code structure, which has rcu_irq_enter() invoking +rcu_nmi_enter() and rcu_irq_exit() invoking rcu_nmi_exit(). +And yes, I also learned of this requirement the hard way. + +Loadable Modules +~~~~~~~~~~~~~~~~ + +The Linux kernel has loadable modules, and these modules can also be +unloaded. After a given module has been unloaded, any attempt to call +one of its functions results in a segmentation fault. The module-unload +functions must therefore cancel any delayed calls to loadable-module +functions, for example, any outstanding mod_timer() must be dealt +with via del_timer_sync() or similar. + +Unfortunately, there is no way to cancel an RCU callback; once you +invoke call_rcu(), the callback function is eventually going to be +invoked, unless the system goes down first. Because it is normally +considered socially irresponsible to crash the system in response to a +module unload request, we need some other way to deal with in-flight RCU +callbacks. + +RCU therefore provides rcu_barrier(), which waits until all +in-flight RCU callbacks have been invoked. If a module uses +call_rcu(), its exit function should therefore prevent any future +invocation of call_rcu(), then invoke rcu_barrier(). In theory, +the underlying module-unload code could invoke rcu_barrier() +unconditionally, but in practice this would incur unacceptable +latencies. + +Nikita Danilov noted this requirement for an analogous +filesystem-unmount situation, and Dipankar Sarma incorporated +rcu_barrier() into RCU. The need for rcu_barrier() for module +unloading became apparent later. + +.. important:: + + The rcu_barrier() function is not, repeat, + *not*, obligated to wait for a grace period. It is instead only required + to wait for RCU callbacks that have already been posted. Therefore, if + there are no RCU callbacks posted anywhere in the system, + rcu_barrier() is within its rights to return immediately. Even if + there are callbacks posted, rcu_barrier() does not necessarily need + to wait for a grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! Each RCU callbacks must wait for a grace period to | +| complete, and rcu_barrier() must wait for each pre-existing | +| callback to be invoked. Doesn't rcu_barrier() therefore need to | +| wait for a full grace period if there is even one callback posted | +| anywhere in the system? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Absolutely not!!! | +| Yes, each RCU callbacks must wait for a grace period to complete, but | +| it might well be partly (or even completely) finished waiting by the | +| time rcu_barrier() is invoked. In that case, rcu_barrier() | +| need only wait for the remaining portion of the grace period to | +| elapse. So even if there are quite a few callbacks posted, | +| rcu_barrier() might well return quite quickly. | +| | +| So if you need to wait for a grace period as well as for all | +| pre-existing callbacks, you will need to invoke both | +| synchronize_rcu() and rcu_barrier(). If latency is a concern, | +| you can always use workqueues to invoke them concurrently. | ++-----------------------------------------------------------------------+ + +Hotplug CPU +~~~~~~~~~~~ + +The Linux kernel supports CPU hotplug, which means that CPUs can come +and go. It is of course illegal to use any RCU API member from an +offline CPU, with the exception of `SRCU <Sleepable RCU_>`__ read-side +critical sections. This requirement was present from day one in +DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug +implementation is “interesting.” + +The Linux-kernel CPU-hotplug implementation has notifiers that are used +to allow the various kernel subsystems (including RCU) to respond +appropriately to a given CPU-hotplug operation. Most RCU operations may +be invoked from CPU-hotplug notifiers, including even synchronous +grace-period operations such as (synchronize_rcu() and +synchronize_rcu_expedited()). However, these synchronous operations +do block and therefore cannot be invoked from notifiers that execute via +stop_machine(), specifically those between the ``CPUHP_AP_OFFLINE`` +and ``CPUHP_AP_ONLINE`` states. + +In addition, all-callback-wait operations such as rcu_barrier() may +not be invoked from any CPU-hotplug notifier. This restriction is due +to the fact that there are phases of CPU-hotplug operations where the +outgoing CPU's callbacks will not be invoked until after the CPU-hotplug +operation ends, which could also result in deadlock. Furthermore, +rcu_barrier() blocks CPU-hotplug operations during its execution, +which results in another type of deadlock when invoked from a CPU-hotplug +notifier. + +Finally, RCU must avoid deadlocks due to interaction between hotplug, +timers and grace period processing. It does so by maintaining its own set +of books that duplicate the centrally maintained ``cpu_online_mask``, +and also by reporting quiescent states explicitly when a CPU goes +offline. This explicit reporting of quiescent states avoids any need +for the force-quiescent-state loop (FQS) to report quiescent states for +offline CPUs. However, as a debugging measure, the FQS loop does splat +if offline CPUs block an RCU grace period for too long. + +An offline CPU's quiescent state will be reported either: + +1. As the CPU goes offline using RCU's hotplug notifier (rcu_report_dead()). +2. When grace period initialization (rcu_gp_init()) detects a + race either with CPU offlining or with a task unblocking on a leaf + ``rcu_node`` structure whose CPUs are all offline. + +The CPU-online path (rcu_cpu_starting()) should never need to report +a quiescent state for an offline CPU. However, as a debugging measure, +it does emit a warning if a quiescent state was not already reported +for that CPU. + +During the checking/modification of RCU's hotplug bookkeeping, the +corresponding CPU's leaf node lock is held. This avoids race conditions +between RCU's hotplug notifier hooks, the grace period initialization +code, and the FQS loop, all of which refer to or modify this bookkeeping. + +Scheduler and RCU +~~~~~~~~~~~~~~~~~ + +RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time +accumulation by these kthreads. This requirement was no surprise, but +RCU's violation of it when running context-switch-heavy workloads when +built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise +[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__. +RCU has made good progress towards meeting this requirement, even for +context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is +room for further improvement. + +There is no longer any prohibition against holding any of +scheduler's runqueue or priority-inheritance spinlocks across an +rcu_read_unlock(), even if interrupts and preemption were enabled +somewhere within the corresponding RCU read-side critical section. +Therefore, it is now perfectly legal to execute rcu_read_lock() +with preemption enabled, acquire one of the scheduler locks, and hold +that lock across the matching rcu_read_unlock(). + +Similarly, the RCU flavor consolidation has removed the need for negative +nesting. The fact that interrupt-disabled regions of code act as RCU +read-side critical sections implicitly avoids earlier issues that used +to result in destructive recursion via interrupt handler's use of RCU. + +Tracing and RCU +~~~~~~~~~~~~~~~ + +It is possible to use tracing on RCU code, but tracing itself uses RCU. +For this reason, rcu_dereference_raw_check() is provided for use +by tracing, which avoids the destructive recursion that could otherwise +ensue. This API is also used by virtualization in some architectures, +where RCU readers execute in environments in which tracing cannot be +used. The tracing folks both located the requirement and provided the +needed fix, so this surprise requirement was relatively painless. + +Accesses to User Memory and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel needs to access user-space memory, for example, to access data +referenced by system-call parameters. The get_user() macro does this job. + +However, user-space memory might well be paged out, which means that +get_user() might well page-fault and thus block while waiting for the +resulting I/O to complete. It would be a very bad thing for the compiler to +reorder a get_user() invocation into an RCU read-side critical section. + +For example, suppose that the source code looked like this: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); + +The compiler must not be permitted to transform this source code into +the following: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); + +If the compiler did make this transformation in a ``CONFIG_PREEMPTION=n`` kernel +build, and if get_user() did page fault, the result would be a quiescent +state in the middle of an RCU read-side critical section. This misplaced +quiescent state could result in line 4 being a use-after-free access, +which could be bad for your kernel's actuarial statistics. Similar examples +can be constructed with the call to get_user() preceding the +rcu_read_lock(). + +Unfortunately, get_user() doesn't have any particular ordering properties, +and in some architectures the underlying ``asm`` isn't even marked +``volatile``. And even if it was marked ``volatile``, the above access to +``p->value`` is not volatile, so the compiler would not have any reason to keep +those two accesses in order. + +Therefore, the Linux-kernel definitions of rcu_read_lock() and +rcu_read_unlock() must act as compiler barriers, at least for outermost +instances of rcu_read_lock() and rcu_read_unlock() within a nested set +of RCU read-side critical sections. + +Energy Efficiency +~~~~~~~~~~~~~~~~~ + +Interrupting idle CPUs is considered socially unacceptable, especially +by people with battery-powered embedded systems. RCU therefore conserves +energy by detecting which CPUs are idle, including tracking CPUs that +have been interrupted from idle. This is a large part of the +energy-efficiency requirement, so I learned of this via an irate phone +call. + +Because RCU avoids interrupting idle CPUs, it is illegal to execute an +RCU read-side critical section on an idle CPU. (Kernels built with +``CONFIG_PROVE_RCU=y`` will splat if you try it.) The RCU_NONIDLE() +macro and ``_rcuidle`` event tracing is provided to work around this +restriction. In addition, rcu_is_watching() may be used to test +whether or not it is currently legal to run RCU read-side critical +sections on this CPU. I learned of the need for diagnostics on the one +hand and RCU_NONIDLE() on the other while inspecting idle-loop code. +Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite +heavily in the idle loop. However, there are some restrictions on the +code placed within RCU_NONIDLE(): + +#. Blocking is prohibited. In practice, this is not a serious + restriction given that idle tasks are prohibited from blocking to + begin with. +#. Although nesting RCU_NONIDLE() is permitted, they cannot nest + indefinitely deeply. However, given that they can be nested on the + order of a million deep, even on 32-bit systems, this should not be a + serious restriction. This nesting limit would probably be reached + long after the compiler OOMed or the stack overflowed. +#. Any code path that enters RCU_NONIDLE() must sequence out of that + same RCU_NONIDLE(). For example, the following is grossly + illegal: + + :: + + 1 RCU_NONIDLE({ + 2 do_something(); + 3 goto bad_idea; /* BUG!!! */ + 4 do_something_else();}); + 5 bad_idea: + + + It is just as illegal to transfer control into the middle of + RCU_NONIDLE()'s argument. Yes, in theory, you could transfer in + as long as you also transferred out, but in practice you could also + expect to get sharply worded review comments. + +It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU +running in userspace. RCU must therefore track ``nohz_full`` userspace +execution. RCU must therefore be able to sample state at two points in +time, and be able to determine whether or not some other CPU spent any +time idle and/or executing in userspace. + +These energy-efficiency requirements have proven quite difficult to +understand and to meet, for example, there have been more than five +clean-sheet rewrites of RCU's energy-efficiency code, the last of which +was finally able to demonstrate `real energy savings running on real +hardware +[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__. +As noted earlier, I learned of many of these requirements via angry +phone calls: Flaming me on the Linux-kernel mailing list was apparently +not sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +Scheduling-Clock Interrupts and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. Depending on kernel configuration, RCU +handles these states differently: + ++-----------------+------------------+------------------+-----------------+ +| ``HZ`` Kconfig | In-Kernel | Usermode | Idle | ++=================+==================+==================+=================+ +| ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | +| | sometimes rely | RCU's | RCU's | +| | on | dyntick-idle | dyntick-idle | +| | scheduling-clock | detection. | detection. | +| | interrupt. In | | | +| | other cases, it | | | +| | is necessary to | | | +| | bound kernel | | | +| | execution times | | | +| | and/or use | | | +| | IPIs. | | | ++-----------------+------------------+------------------+-----------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | +| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | +| ``NO_HZ_IDLE`` do? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because, as a performance optimization, ``NO_HZ_FULL`` does not | +| necessarily re-enable the scheduling-clock interrupt on entry to each | +| and every system call. | ++-----------------------------------------------------------------------+ + +However, RCU must be reliably informed as to whether any given CPU is +currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that +CPU is executing in usermode, as discussed +`earlier <Energy Efficiency_>`__. It also requires that the +scheduling-clock interrupt be enabled when RCU needs it to be: + +#. If a CPU is either idle or executing in usermode, and RCU believes it + is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, very long + (11-second) grace periods, with a pointless IPI waking the CPU from + time to time. +#. If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. **DON'T DO THIS!!!** + This is one reason to test with lockdep, which will complain about + this sort of thing. +#. If a CPU is in a portion of the kernel that is absolutely positively + no-joking guaranteed to never execute any RCU read-side critical + sections, and RCU believes this CPU to be idle, no problem. This + sort of thing is used by some architectures for light-weight + exception handlers, which can then avoid the overhead of + rcu_irq_enter() and rcu_irq_exit() at exception entry and + exit, respectively. Some go further and avoid the entireties of + irq_enter() and irq_exit(). + Just make very sure you are running some of your tests with + ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in + fact joking about not doing RCU read-side critical sections. +#. If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, and if + the CPU goes idle (from an RCU perspective) every few jiffies, no + problem. It is usually OK for there to be the occasional gap between + idle periods of up to a second or so. + If the gap grows too long, you get RCU CPU stall warnings. +#. If a CPU is either idle or executing in usermode, and RCU believes it + to be idle, of course no problem. +#. If a CPU is executing in the kernel, the kernel code path is passing + through quiescent states at a reasonable frequency (preferably about + once per few jiffies, but the occasional excursion to a second or so + is usually OK) and the scheduling-clock interrupt is enabled, of + course no problem. + If the gap between a successive pair of quiescent states grows too + long, you get RCU CPU stall warnings. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what if my driver has a hardware interrupt handler that can run | +| for many seconds? I cannot invoke schedule() from an hardware | +| interrupt handler, after all! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | +| often. But given that long-running interrupt handlers can cause other | +| problems, not least for response time, shouldn't you work to keep | +| your interrupt handler's runtime within reasonable bounds? | ++-----------------------------------------------------------------------+ + +But as long as RCU is properly informed of kernel state transitions +between in-kernel execution, usermode execution, and idle, and as long +as the scheduling-clock interrupt is enabled when RCU needs it to be, +you can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + +Memory Efficiency +~~~~~~~~~~~~~~~~~ + +Although small-memory non-realtime systems can simply use Tiny RCU, code +size is only one aspect of memory efficiency. Another aspect is the size +of the ``rcu_head`` structure used by call_rcu() and +kfree_rcu(). Although this structure contains nothing more than a +pair of pointers, it does appear in many RCU-protected data structures, +including some that are size critical. The ``page`` structure is a case +in point, as evidenced by the many occurrences of the ``union`` keyword +within that structure. + +This need for memory efficiency is one reason that RCU uses hand-crafted +singly linked lists to track the ``rcu_head`` structures that are +waiting for a grace period to elapse. It is also the reason why +``rcu_head`` structures do not contain debug information, such as fields +tracking the file and line of the call_rcu() or kfree_rcu() that +posted them. Although this information might appear in debug-only kernel +builds at some point, in the meantime, the ``->func`` field will often +provide the needed debug information. + +However, in some cases, the need for memory efficiency leads to even +more extreme measures. Returning to the ``page`` structure, the +``rcu_head`` field shares storage with a great many other structures +that are used at various points in the corresponding page's lifetime. In +order to correctly resolve certain `race +conditions <https://lore.kernel.org/r/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__, +the Linux kernel's memory-management subsystem needs a particular bit to +remain zero during all phases of grace-period processing, and that bit +happens to map to the bottom bit of the ``rcu_head`` structure's +``->next`` field. RCU makes this guarantee as long as call_rcu() is +used to post the callback, as opposed to kfree_rcu() or some future +“lazy” variant of call_rcu() that might one day be created for +energy-efficiency purposes. + +That said, there are limits. RCU requires that the ``rcu_head`` +structure be aligned to a two-byte boundary, and passing a misaligned +``rcu_head`` structure to one of the call_rcu() family of functions +will result in a splat. It is therefore necessary to exercise caution +when packing structures containing fields of type ``rcu_head``. Why not +a four-byte or even eight-byte alignment requirement? Because the m68k +architecture provides only two-byte alignment, and thus acts as +alignment's least common denominator. + +The reason for reserving the bottom bit of pointers to ``rcu_head`` +structures is to leave the door open to “lazy” callbacks whose +invocations can safely be deferred. Deferring invocation could +potentially have energy-efficiency benefits, but only if the rate of +non-lazy callbacks decreases significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open in case +it one day becomes useful. + +Performance, Scalability, Response Time, and Reliability +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Expanding on the `earlier +discussion <Performance and Scalability_>`__, RCU is used heavily by +hot code paths in performance-critical portions of the Linux kernel's +networking, security, virtualization, and scheduling code paths. RCU +must therefore use efficient implementations, especially in its +read-side primitives. To that end, it would be good if preemptible RCU's +implementation of rcu_read_lock() could be inlined, however, doing +this requires resolving ``#include`` issues with the ``task_struct`` +structure. + +The Linux kernel supports hardware configurations with up to 4096 CPUs, +which means that RCU must be extremely scalable. Algorithms that involve +frequent acquisitions of global locks or frequent atomic operations on +global variables simply cannot be tolerated within the RCU +implementation. RCU therefore makes heavy use of a combining tree based +on the ``rcu_node`` structure. RCU is required to tolerate all CPUs +continuously invoking any combination of RCU's runtime primitives with +minimal per-operation overhead. In fact, in many cases, increasing load +must *decrease* the per-operation overhead, witness the batching +optimizations for synchronize_rcu(), call_rcu(), +synchronize_rcu_expedited(), and rcu_barrier(). As a general +rule, RCU must cheerfully accept whatever the rest of the Linux kernel +decides to throw at it. + +The Linux kernel is used for real-time workloads, especially in +conjunction with the `-rt +patchset <https://wiki.linuxfoundation.org/realtime/>`__. The +real-time-latency response requirements are such that the traditional +approach of disabling preemption across RCU read-side critical sections +is inappropriate. Kernels built with ``CONFIG_PREEMPTION=y`` therefore use +an RCU implementation that allows RCU read-side critical sections to be +preempted. This requirement made its presence known after users made it +clear that an earlier `real-time +patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in +conjunction with some `RCU +issues <https://lore.kernel.org/r/20050318002026.GA2693@us.ibm.com>`__ +encountered by a very early version of the -rt patchset. + +In addition, RCU must make do with a sub-100-microsecond real-time +latency budget. In fact, on smaller systems with the -rt patchset, the +Linux kernel provides sub-20-microsecond real-time latencies for the +whole kernel, including RCU. RCU's scalability and latency must +therefore be sufficient for these sorts of configurations. To my +surprise, the sub-100-microsecond real-time latency budget `applies to +even the largest systems +[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__, +up to and including systems with 4096 CPUs. This real-time requirement +motivated the grace-period kthread, which also simplified handling of a +number of race conditions. + +RCU must avoid degrading real-time response for CPU-bound threads, +whether executing in usermode (which is one use case for +``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in +the kernel must execute cond_resched() at least once per few tens of +milliseconds in order to avoid receiving an IPI from RCU. + +Finally, RCU's status as a synchronization primitive means that any RCU +failure can result in arbitrary memory corruption that can be extremely +difficult to debug. This means that RCU must be extremely reliable, +which in practice also means that RCU must have an aggressive +stress-test suite. This stress-test suite is called ``rcutorture``. + +Although the need for ``rcutorture`` was no surprise, the current +immense popularity of the Linux kernel is posing interesting—and perhaps +unprecedented—validation challenges. To see this, keep in mind that +there are well over one billion instances of the Linux kernel running +today, given Android smartphones, Linux-powered televisions, and +servers. This number can be expected to increase sharply with the advent +of the celebrated Internet of Things. + +Suppose that RCU contains a race condition that manifests on average +once per million years of runtime. This bug will be occurring about +three times per *day* across the installed base. RCU could simply hide +behind hardware error rates, given that no one should really expect +their smartphone to last for a million years. However, anyone taking too +much comfort from this thought should consider the fact that in most +jurisdictions, a successful multi-year test of a given mechanism, which +might include a Linux kernel, suffices for a number of types of +safety-critical certifications. In fact, rumor has it that the Linux +kernel is already being used in production for safety-critical +applications. I don't know about you, but I would feel quite bad if a +bug in RCU killed someone. Which might explain my recent focus on +validation and verification. + +Other RCU Flavors +----------------- + +One of the more surprising things about RCU is that there are now no +fewer than five *flavors*, or API families. In addition, the primary +flavor that has been the sole focus up to this point has two different +implementations, non-preemptible and preemptible. The other four flavors +are listed below, with requirements for each described in a separate +section. + +#. `Bottom-Half Flavor (Historical)`_ +#. `Sched Flavor (Historical)`_ +#. `Sleepable RCU`_ +#. `Tasks RCU`_ + +Bottom-Half Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-bh flavor of RCU has since been expressed in terms of the other +RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +softirq and to be accounted for by lockdep. Much of the material in this +section is therefore strictly historical in nature. + +The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) +flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a +flavor of RCU that could withstand the network-based denial-of-service +attacks researched by Robert Olsson. These attacks placed so much +networking load on the system that some of the CPUs never exited softirq +execution, which in turn prevented those CPUs from ever executing a +context switch, which, in the RCU implementation of that time, prevented +grace periods from ever ending. The result was an out-of-memory +condition and a system hang. + +The solution was the creation of RCU-bh, which does +local_bh_disable() across its read-side critical sections, and which +uses the transition from one type of softirq processing to another as a +quiescent state in addition to context switch, idle, user mode, and +offline. This means that RCU-bh grace periods can complete even when +some of the CPUs execute in softirq indefinitely, thus allowing +algorithms based on RCU-bh to withstand network-based denial-of-service +attacks. + +Because rcu_read_lock_bh() and rcu_read_unlock_bh() disable and +re-enable softirq handlers, any attempt to start a softirq handlers +during the RCU-bh read-side critical section will be deferred. In this +case, rcu_read_unlock_bh() will invoke softirq processing, which can +take considerable time. One can of course argue that this softirq +overhead should be associated with the code following the RCU-bh +read-side critical section rather than rcu_read_unlock_bh(), but the +fact is that most profiling tools cannot be expected to make this sort +of fine distinction. For example, suppose that a three-millisecond-long +RCU-bh read-side critical section executes during a time of heavy +networking load. There will very likely be an attempt to invoke at least +one softirq handler during that three milliseconds, but any such +invocation will be delayed until the time of the +rcu_read_unlock_bh(). This can of course make it appear at first +glance as if rcu_read_unlock_bh() was executing very slowly. + +The `RCU-bh +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes rcu_read_lock_bh(), rcu_read_unlock_bh(), rcu_dereference_bh(), +rcu_dereference_bh_check(), and rcu_read_lock_bh_held(). However, the +old RCU-bh update-side APIs are now gone, replaced by synchronize_rcu(), +synchronize_rcu_expedited(), call_rcu(), and rcu_barrier(). In addition, +anything that disables bottom halves also marks an RCU-bh read-side +critical section, including local_bh_disable() and local_bh_enable(), +local_irq_save() and local_irq_restore(), and so on. + +Sched Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-sched flavor of RCU has since been expressed in terms of the +other RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +preemption and to be accounted for by lockdep. Much of the material in +this section is therefore strictly historical in nature. + +Before preemptible RCU, waiting for an RCU grace period had the side +effect of also waiting for all pre-existing interrupt and NMI handlers. +However, there are legitimate preemptible-RCU implementations that do +not have this property, given that any point in the code outside of an +RCU read-side critical section can be a quiescent state. Therefore, +*RCU-sched* was created, which follows “classic” RCU in that an +RCU-sched grace period waits for pre-existing interrupt and NMI +handlers. In kernels built with ``CONFIG_PREEMPTION=n``, the RCU and +RCU-sched APIs have identical implementations, while kernels built with +``CONFIG_PREEMPTION=y`` provide a separate implementation for each. + +Note well that in ``CONFIG_PREEMPTION=y`` kernels, +rcu_read_lock_sched() and rcu_read_unlock_sched() disable and +re-enable preemption, respectively. This means that if there was a +preemption attempt during the RCU-sched read-side critical section, +rcu_read_unlock_sched() will enter the scheduler, with all the +latency and overhead entailed. Just as with rcu_read_unlock_bh(), +this can make it look as if rcu_read_unlock_sched() was executing +very slowly. However, the highest-priority task won't be preempted, so +that task will enjoy low-overhead rcu_read_unlock_sched() +invocations. + +The `RCU-sched +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes rcu_read_lock_sched(), rcu_read_unlock_sched(), +rcu_read_lock_sched_notrace(), rcu_read_unlock_sched_notrace(), +rcu_dereference_sched(), rcu_dereference_sched_check(), and +rcu_read_lock_sched_held(). However, the old RCU-sched update-side APIs +are now gone, replaced by synchronize_rcu(), synchronize_rcu_expedited(), +call_rcu(), and rcu_barrier(). In addition, anything that disables +preemption also marks an RCU-sched read-side critical section, +including preempt_disable() and preempt_enable(), local_irq_save() +and local_irq_restore(), and so on. + +Sleepable RCU +~~~~~~~~~~~~~ + +For well over a decade, someone saying “I need to block within an RCU +read-side critical section” was a reliable indication that this someone +did not understand RCU. After all, if you are always blocking in an RCU +read-side critical section, you can probably afford to use a +higher-overhead synchronization mechanism. However, that changed with +the advent of the Linux kernel's notifiers, whose RCU read-side critical +sections almost never sleep, but sometimes need to. This resulted in the +introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or +*SRCU*. + +SRCU allows different domains to be defined, with each such domain +defined by an instance of an ``srcu_struct`` structure. A pointer to +this structure must be passed in to each SRCU function, for example, +``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` +structure. The key benefit of these domains is that a slow SRCU reader +in one domain does not delay an SRCU grace period in some other domain. +That said, one consequence of these domains is that read-side code must +pass a “cookie” from srcu_read_lock() to srcu_read_unlock(), for +example, as follows: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 srcu_read_unlock(&ss, idx); + +As noted above, it is legal to block within SRCU read-side critical +sections, however, with great power comes great responsibility. If you +block forever in one of a given domain's SRCU read-side critical +sections, then that domain's grace periods will also be blocked forever. +Of course, one good way to block forever is to deadlock, which can +happen if any operation in a given domain's SRCU read-side critical +section can wait, either directly or indirectly, for that domain's grace +period to elapse. For example, this results in a self-deadlock: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 synchronize_srcu(&ss); + 6 srcu_read_unlock(&ss, idx); + +However, if line 5 acquired a mutex that was held across a +synchronize_srcu() for domain ``ss``, deadlock would still be +possible. Furthermore, if line 5 acquired a mutex that was held across a +synchronize_srcu() for some other domain ``ss1``, and if an +``ss1``-domain SRCU read-side critical section acquired another mutex +that was held across as ``ss``-domain synchronize_srcu(), deadlock +would again be possible. Such a deadlock cycle could extend across an +arbitrarily large number of different SRCU domains. Again, with great +power comes great responsibility. + +Unlike the other RCU flavors, SRCU read-side critical sections can run +on idle and even offline CPUs. This ability requires that +srcu_read_lock() and srcu_read_unlock() contain memory barriers, +which means that SRCU readers will run a bit slower than would RCU +readers. It also motivates the smp_mb__after_srcu_read_unlock() API, +which, in combination with srcu_read_unlock(), guarantees a full +memory barrier. + +Also unlike other RCU flavors, synchronize_srcu() may **not** be +invoked from CPU-hotplug notifiers, due to the fact that SRCU grace +periods make use of timers and the possibility of timers being +temporarily “stranded” on the outgoing CPU. This stranding of timers +means that timers posted to the outgoing CPU will not fire until late in +the CPU-hotplug process. The problem is that if a notifier is waiting on +an SRCU grace period, that grace period is waiting on a timer, and that +timer is stranded on the outgoing CPU, then the notifier will never be +awakened, in other words, deadlock has occurred. This same situation of +course also prohibits srcu_barrier() from being invoked from +CPU-hotplug notifiers. + +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. This +means that in the current SRCU implementation, expediting a future grace +period has the side effect of expediting all prior grace periods that +have not yet completed. (But please note that this is a property of the +current implementation, not necessarily of future implementations.) In +addition, if SRCU has been idle for longer than the interval specified +by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds +by default), and if a synchronize_srcu() invocation ends this idle +period, that invocation will be automatically expedited. + +As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a +locking bottleneck present in prior kernel versions. Although this will +allow users to put much heavier stress on call_srcu(), it is +important to note that SRCU does not yet take any special steps to deal +with callback flooding. So if you are posting (say) 10,000 SRCU +callbacks per second per CPU, you are probably totally OK, but if you +intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please +run some tests first. SRCU just might need a few adjustment to deal with +that sort of load. Of course, your mileage may vary based on the speed +of your CPUs and the size of your memory. + +The `SRCU +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes srcu_read_lock(), srcu_read_unlock(), +srcu_dereference(), srcu_dereference_check(), +synchronize_srcu(), synchronize_srcu_expedited(), +call_srcu(), srcu_barrier(), and srcu_read_lock_held(). It +also includes DEFINE_SRCU(), DEFINE_STATIC_SRCU(), and +init_srcu_struct() APIs for defining and initializing +``srcu_struct`` structures. + +More recently, the SRCU API has added polling interfaces: + +#. start_poll_synchronize_srcu() returns a cookie identifying + the completion of a future SRCU grace period and ensures + that this grace period will be started. +#. poll_state_synchronize_srcu() returns ``true`` iff the + specified cookie corresponds to an already-completed + SRCU grace period. +#. get_state_synchronize_srcu() returns a cookie just like + start_poll_synchronize_srcu() does, but differs in that + it does nothing to ensure that any future SRCU grace period + will be started. + +These functions are used to avoid unnecessary SRCU grace periods in +certain types of buffer-cache algorithms having multi-stage age-out +mechanisms. The idea is that by the time the block has aged completely +from the cache, an SRCU grace period will be very likely to have elapsed. + +Tasks RCU +~~~~~~~~~ + +Some forms of tracing use “trampolines” to handle the binary rewriting +required to install different types of probes. It would be good to be +able to free old trampolines, which sounds like a job for some form of +RCU. However, because it is necessary to be able to install a trace +anywhere in the code, it is not possible to use read-side markers such +as rcu_read_lock() and rcu_read_unlock(). In addition, it does +not work to have these markers in the trampoline itself, because there +would need to be instructions following rcu_read_unlock(). Although +synchronize_rcu() would guarantee that execution reached the +rcu_read_unlock(), it would not be able to guarantee that execution +had completely left the trampoline. Worse yet, in some situations +the trampoline's protection must extend a few instructions *prior* to +execution reaching the trampoline. For example, these few instructions +might calculate the address of the trampoline, so that entering the +trampoline would be pre-ordained a surprisingly long time before execution +actually reached the trampoline itself. + +The solution, in the form of `Tasks +RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side +critical sections that are delimited by voluntary context switches, that +is, calls to schedule(), cond_resched(), and +synchronize_rcu_tasks(). In addition, transitions to and from +userspace execution also delimit tasks-RCU read-side critical sections. + +The tasks-RCU API is quite compact, consisting only of +call_rcu_tasks(), synchronize_rcu_tasks(), and +rcu_barrier_tasks(). In ``CONFIG_PREEMPTION=n`` kernels, trampolines +cannot be preempted, so these APIs map to call_rcu(), +synchronize_rcu(), and rcu_barrier(), respectively. In +``CONFIG_PREEMPTION=y`` kernels, trampolines can be preempted, and these +three APIs are therefore implemented by separate functions that check +for voluntary context switches. + +Tasks Rude RCU +~~~~~~~~~~~~~~ + +Some forms of tracing need to wait for all preemption-disabled regions +of code running on any online CPU, including those executed when RCU is +not watching. This means that synchronize_rcu() is insufficient, and +Tasks Rude RCU must be used instead. This flavor of RCU does its work by +forcing a workqueue to be scheduled on each online CPU, hence the "Rude" +moniker. And this operation is considered to be quite rude by real-time +workloads that don't want their ``nohz_full`` CPUs receiving IPIs and +by battery-powered systems that don't want their idle CPUs to be awakened. + +The tasks-rude-RCU API is also reader-marking-free and thus quite compact, +consisting of call_rcu_tasks_rude(), synchronize_rcu_tasks_rude(), +and rcu_barrier_tasks_rude(). + +Tasks Trace RCU +~~~~~~~~~~~~~~~ + +Some forms of tracing need to sleep in readers, but cannot tolerate +SRCU's read-side overhead, which includes a full memory barrier in both +srcu_read_lock() and srcu_read_unlock(). This need is handled by a +Tasks Trace RCU that uses scheduler locking and IPIs to synchronize with +readers. Real-time systems that cannot tolerate IPIs may build their +kernels with ``CONFIG_TASKS_TRACE_RCU_READ_MB=y``, which avoids the IPIs at +the expense of adding full memory barriers to the read-side primitives. + +The tasks-trace-RCU API is also reasonably compact, +consisting of rcu_read_lock_trace(), rcu_read_unlock_trace(), +rcu_read_lock_trace_held(), call_rcu_tasks_trace(), +synchronize_rcu_tasks_trace(), and rcu_barrier_tasks_trace(). + +Possible Future Changes +----------------------- + +One of the tricks that RCU uses to attain update-side scalability is to +increase grace-period latency with increasing numbers of CPUs. If this +becomes a serious problem, it will be necessary to rework the +grace-period state machine so as to avoid the need for the additional +latency. + +RCU disables CPU hotplug in a few places, perhaps most notably in the +rcu_barrier() operations. If there is a strong reason to use +rcu_barrier() in CPU-hotplug notifiers, it will be necessary to +avoid disabling CPU hotplug. This would introduce some complexity, so +there had better be a *very* good reason. + +The tradeoff between grace-period latency on the one hand and +interruptions of other CPUs on the other hand may need to be +re-examined. The desire is of course for zero grace-period latency as +well as zero interprocessor interrupts undertaken during an expedited +grace period operation. While this ideal is unlikely to be achievable, +it is quite possible that further improvements can be made. + +The multiprocessor implementations of RCU use a combining tree that +groups CPUs so as to reduce lock contention and increase cache locality. +However, this combining tree does not spread its memory across NUMA +nodes nor does it align the CPU groups with hardware features such as +sockets or cores. Such spreading and alignment is currently believed to +be unnecessary because the hotpath read-side primitives do not access +the combining tree, nor does call_rcu() in the common case. If you +believe that your architecture needs such spreading and alignment, then +your architecture should also benefit from the +``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the +number of CPUs in a socket, NUMA node, or whatever. If the number of +CPUs is too large, use a fraction of the number of CPUs. If the number +of CPUs is a large prime number, well, that certainly is an +“interesting” architectural choice! More flexible arrangements might be +considered, but only if ``rcutree.rcu_fanout_leaf`` has proven +inadequate, and only if the inadequacy has been demonstrated by a +carefully run and realistic system-level workload. + +Please note that arrangements that require RCU to remap CPU numbers will +require extremely good demonstration of need and full exploration of +alternatives. + +RCU's various kthreads are reasonably recent additions. It is quite +likely that adjustments will be required to more gracefully handle +extreme loads. It might also be necessary to be able to relate CPU +utilization by RCU's kthreads and softirq handlers to the code that +instigated this CPU utilization. For example, RCU callback overhead +might be charged back to the originating call_rcu() instance, though +probably not in production kernels. + +Additional work may be required to provide reasonable forward-progress +guarantees under heavy load for grace periods and for callback +invocation. + +Summary +------- + +This document has presented more than two decade's worth of RCU +requirements. Given that the requirements keep changing, this will not +be the last word on this subject, but at least it serves to get an +important subset of the requirements set forth. + +Acknowledgments +--------------- + +I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg +Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy +Lutomirski for their help in rendering this article human readable, and +to Michelle Rankin for her support of this effort. Other contributions +are acknowledged in the Linux kernel's git archive. |