doc/users-guide/users-guide-tm.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420

== Traffic Manager \(TM)

The TM subsystem is a general packet scheduling system that accepts
packets from input queues and applies strict priority scheduling, weighted fair
queueing scheduling and/or bandwidth controls to decide which input packet
should be chosen as the next output packet and when this output packet can be
sent onwards.

A given platform supporting this TM API could support one or more pure hardware
based packet scheduling systems, one or more pure software based systems or one
or more hybrid systems - where because of hardware constraints some of the
packet scheduling is done in hardware and some is done in software.  In
addition, there may also be additional APIs beyond those described here for:

- controlling advanced capabilities supported by specific hardware, software
or hybrid subsystems
- dealing with constraints and limitations of
specific implementations.

The intention here is to be the simplest API that covers the vast majority of
packet scheduling requirements.

Often a TM subsystem's output(s) will be directly connected to a device's
physical (or virtual) output interfaces/links, in which case sometimes such a
system will be called an Egress Packet Scheduler or an Output Link Shaper,
etc..  While the TM subsystems configured by this API can be used in such a
way, this API equally well supports the ability to have the TM subsystem's
outputs connect to other TM subsystem input queues or general software queues
or even some combination of these three cases.

=== TM Algorithms

The packet scheduling/dropping techniques that can be applied to input
traffic include any mixture of the following:

- Strict Priority scheduling.
- Weighted Fair Queueing scheduling (WFQ).
- Bandwidth Shaping.
- Weighted Random Early Discard (WRED).

Note that Bandwidth Shaping is the only feature that can cause packets to be
"delayed", and Weighted Random Early Discard is the only feature (other than
input queues becoming full) that can cause packets to be dropped.

==== Strict Priority Scheduling

Strict Priority Scheduling (or just priority for short), is a technique where input
queues and the packets from them, are assigned a priority value in the range 0
.. ODP_TM_MAX_PRIORITIES - 1.  At all times packets with the smallest priority
value will be chosen ahead of packets with a numerically larger priority value.
This is called strict priority scheduling because the algorithm strictly
enforces the scheduling of higher priority packets over lower priority
packets.

==== Bandwidth Shaping

Bandwidth Shaping (or often just Shaping) is the term used here for the idea of
controlling packet rates using single rate and/or dual rate token bucket
algorithms.  For single rate shaping a rate (the commit rate) and a "burst
size" (the maximum commit count) are configured.  Then an internal signed
integer counter called the _commitCnt_ is maintained such that if the _commitCnt_
is positive then packets are eligible to be sent. When such a packet is
actually sent then its _commitCnt_ is decremented (usually by its length, but one
could decrement by 1 for each packet instead).  The _commitCnt_ is then
incremented periodically based upon the configured rate, so that this technique
causes the traffic to be limited to the commit rate over the long term, while
allowing some ability to exceed this rate for a very short time (based on the
burst size) in order to catch up if the traffic input temporarily drops below
the commit rate.

Dual Rate Shaping is designed to allow  certain traffic flows to fairly send
more than their assigned commit rate when the  scheduler has excess capacity.
The idea being that it may be better to allow some types of traffic to send
more than their committed bandwidth rather than letting the TM outputs be idle.
The configuration of Dual Rate Shaping requires additionally a peak rate and a
peak burst size.  The peak rate must be greater than the related commit
rate, but the burst sizes have no similar constraint.  Also for every input
priority that has Dual Rate shaping enabled, there needs to be an additional
equal or lower priority (equal or higher numeric priority value) assigned.
Then if the traffic exceeds its commit rate but not its peak rate, the
"excess" traffic will be sent at the lower priority level - which by the
strict priority algorithm should cause no degradation of the higher priority
traffic, while allowing for less idle outputs.

==== Weighted Fair Queuing

Weighted Fair Queuing (WFQ) is used to arbitrate among multiple input
packets with the same priority.  Each input can be assigned a weight in the
range MIN_WFQ_WEIGHT..MAX_WFQ_WEIGHT (nominally 1..255) that affects the way
the algorithm chooses the next packet.  If all of the weights are equal AND all
of the input packets are the same length then the algorithm is equivalent to a
round robin scheduling.  If all of the weights are equal but the packets have
different lengths then the WFQ algorithm will attempt to choose the packet such
that inputs each get a fair share of the bandwidth - in other words it
implements a weighted round robin algorithm where the weighting is based on
frame length.

When the input weights are not all equal and the input packet lengths vary then
the WFQ algorithm will schedule packets such that the packet with the lowest
"Virtual Finish Time" is chosen first.  An input packet's Virtual Finish Time
is roughly calculated based on the WFQ object's base Virtual Finish Time when
the packet becomes the first packet in its queue plus its frame length divided
by its weight.
----
virtualFinishTime = wfqVirtualTimeBase + (pktLength / wfqWeight)
----

In a system running at full capacity with no bandwidth limits - over the long
term - each input fan-in's average transmit rate will be the same fraction of
the output bandwidth as the fraction of its weight divided by the sum of all of
the WFQ fan-in weights.  Hence larger WFQ weights result in better "service"
for a given fan-in.

[source,c]
----
totalWfqWeight = 0;
for (each fan-in entity - fanIn - feeding this WFQ scheduler)
	totalWfqWeight += fanIn->sfqWeight;

fanIn->avgTransmitRate = avgOutputRatefanIn->sfqWeight / totalWfqWeight;
----

==== Weighted Random Early Discard

The Weighted Random Early Discard (WRED) algorithm deals with the situation
where an input packet rate exceeds some output rate (including the case where
Bandwidth Shaping limits some output rates).  Without WRED enabled and
configured, the TM system will just implement a tail dropping scheme whereby
whichever packet is unlucky enough to arrive when an TM input queue is full
will be discarded regardless of priority or any other consideration. WRED
allows one to configure the system to use a better/fairer algorithm than simple
tail dropping.  It works by measuring the "fullness" of various packet queues
and converting this percentage into a probability of random packet dropping
with the help of some configurable parameters. Then a random number is picked
and together with the drop probability, a decision is made to accept the packet
or drop it. A basic parameterization of WRED requires three parameters:

- the maximum queue level (which could be either a maximum number of
     packets or a maximum amount of memory (i.e. bytes/buffers) used),
- a starting threshold - which is a number in the range 0..100
     representing a percentage of the maximum queue level at which the
     drop probability becomes non-zero,
- a drop probability - which is a number in the range 0..100
     representing a probability (0 means no drop and 100 means
     certain drop) - which is used when the queue is near 100% full.

Note that all packet drops for a TM system only occur when a new packet arrives
at a given TM system input queue.  At that time either the WRED algorithm, if
enabled for this input queue, or the "input queue full" tail drop algorithm
will make a drop/no drop decision.  After this point, any packets not dropped,
will at some point be sent out a TM output - assuming that the topology is
fully connected and enabled.

=== Hierarchical Scheduling and tm_nodes

This API supports the ability to do Hierarchical Scheduling whereby the
final scheduling decision is controlled by equal priority schedulers,
strict priority multiplexers, bandwidth shapers - at multiple levels - all
forming a tree rooted at a single egress object.  In other words, all
tm_queues and tm_nodes have the property that their logical "output" feeds
into one fan-in of a subsequent tm_node or egress object - forming a proper
tree.

.Hierarchical Scheduling
image::tm_hierarchy.svg[align="center"]

Multi-level/hierarchical scheduling adds both great control and significant
complexity.  Logically, despite the implication of the tm_node tree diagrams,
there are no queues between the levels of hierarchy.  Instead all packets are
held in their input queue, until such time that the totality of all of the
tm_nodes in the single path from input queue to output object agrees that this
packet should be the next to be chosen to leave the TM system through the
output object "portal".  Hence what flows from level to level is the "local
choice" of what packet/tm_queue should next be serviced.

==== tm_nodes

Tm_nodes are the main "entity"/object that a TM system is composed of. Each
tm_node is a mini-TM subsystem of its own, but the interconnection and
interplay of a multi-level "tree" of tm_nodes can allow the user to specify
some very sophisticated behaviors. Each tm_node can contain a set of scheduler
(one per strict priority level), a strict priority multiplexer, a bandwidth
shaper and a WRED component - or a subset of these.

.Traffic Manager Node
image::tm_node.svg[align="center"]

In its full generality an tm_node consists of a set of "fan-in" connections to
preceding tm_queues or tm_nodes.  The fan-in for a single tm_node can range
from 1 to many many thousands.  This fan-in is divided first into a WFQ
scheduler per priority level. So if 4 priority levels are implemented by this
tm_node, there would be 4 WFQ schedulers - each with its own unique fan-in.
After the WFQ schedulers a priority chooser comes next - where it will always
choose the highest priority WFQ output available.  The output of the priority
chooser then feeds a bandwidth shaper function which then finally uses the
shaper's propagation table to determine its output packet and its priority.
This output could then be remapped via a priority map profile and then becomes
one of the input fan-in to perhaps another level of tm_nodes, and so on.

During this process it is important to remember that the bandwidth shaping
function never causes packets to be dropped.  Instead all packet drops occur
because of tm_queue fullness or be running the WRED algorithm at the time a new
packet attempts to be appended to the end of some input queue.

The WRED profile associated with an tm_node considers the entire set of
tm_queues feeding directly or indirectly into it as its measure of queue
fullness.

==== tm_queues

tm_queues are the second major type of "entity"/object that a TM system is
composed of.  All packets MUST first enter the TM system via some tm_queue.
Then logically, the head packets of all of the tm_queues are examined
simultaneously by the entire TM system, and ONE tm_queue is chosen send its
head packet out of the TM system's egress.  Abstractly packets stay in the
tm_queue until they are chosen at which time they are instantly transferred
from tm_queue to/through the corresponding TM egress. It is also important to
note that packets in the same tm_queue MUST always stay in order.  In other
words, the second packet in an tm_queue must never leave the TM system through
a TM egress spigot before the first packet has left the system.  So tm_queue
packet order must always be maintained.

==== TM egress

Note that TM egress objects are NOT referred to as queues, because in many/most
cases they don't have multi-packet structure but instead are viewed as a
port/spigot through which the TM system schedules and finally transfers input
packets through.

=== Ideal versus Actual Behavior

It is important to recognize the difference between the "abstract" mathematical
model of the prescribed behavior and real implementations. The model describes
the Ideal, but theoretically desired behavior, but such an Ideal is generally
not practical to implement.  Instead, one understands that virtually all Real
TM systems attempt to approximate the Ideal behavior as given by the TM
configuration as best as they can - while still attaining high packet
processing performance.  The idea is that instead of trying too hard to be
"perfect" at the granularity of say microseconds, it may be better to instead
try to match the long term Ideal behavior over a much more reasonable period of
time like a millisecond.  It is generally better to have a stable
implementation that when averaged over a period of several milliseconds matches
the Ideal behavior very closely than to have an implementation that is perhaps
more accurate over a period of microseconds, but whose millisecond averaged
behavior drifts away from the Ideal case.

=== Other TM Concepts

==== Profiles

This specification often packages related TM system parameters into
records/objects called profiles.  These profiles can then be associated with
various entities like tm_nodes and tm_queue's.  This way the amount of storage
associated with setting related parameters can be reduced and in addition it is
common to re-use the same set of parameter set over and over again, and also to
be able to change the parameter set once and have it affect lots of entities
with which it is associated with/applied to.

==== Absolute Limits versus  `odp_tm_capability_t`

This header file defines some constants representing the absolute maximum
settings for any TM system, though in most cases a TM system can (and should)
be created/instantiated with smaller values, since lower values will often
result in faster operation and/or less memory used.

==== Packet Marking

The Packet Marking API is used to mark the packet based upon the final packet
color assigned to it when it reaches the egress node.
This is an optional feature and if available on the platform is used to reflect
the packet color on IPv4/IPv6 DiffServ filed in accordance with https://www.ietf.org/rfc/rfc2474.txt[RFC 2474].
There are three different packet marking fields supported they are,
1). Assured forwarding in accordance with https://www.ietf.org/rfc/rfc2597.txt[RFC 2597], the DSCP is marked to
set the packet Drop precedence in accordance with the color, i.e High Drop
precedence for RED, Medium Drop precedence for YELLOW and leave the DSCP
unchanged if the color is GREEN.
2). Explicit Congestion Notification protocol per https://www.ietf.org/rfc/rfc3168.txt[RFC 3168], where a router
encountering congestion can notify it by setting the lower 2 bits in
DiffServ field to "11" Congestion Encountered code, which will ultimately
reduce the transmission rate of the packet sender.
3). In IEEE 802.1q VLAN tag header contains a DE - Drop Eligibility bit for
marking a packet for Downstream switches, and is valid for Ethernet packet
containing a VLAN tag.

RFC 3168 is only valid for TCP packets whereas RFC 2597 is valid for IPv4/IPv6
traffic.

The values are set per color and hence the implementation may support these
parameters only for a specific colors. marking_colors_supported field in
capabilities structure can be used to check which colors are supported for
marking.

==== Vlan Marking.

This vlan marking is used to enable the drop eligibility on the packet
based on the packet color. If drop eligibility is enabled then the
implementation will set the one bit VLAN Drop Eligibility Indicator (DEI)
field (but only for packets that already carry a VLAN tag) of a packet based
upon the final packet color assigned to the packet when it reaches the egress
node.  When drop_eligible_enabled is false, then the given color has
no effect on the VLAN fields.  See IEEE 802.1q for more details.
`vlan_marking_supported` boolean in capability structure indicates whether this
feature is supported by the implementation.

==== Explicit Congestion Notification Marking.

The `odp_tm_ecn_marking()` function allows one to configure the TM
egress so that the two bit ECN subfield of the eight bit TOS field of an
IPv4 packet OR the eight bit Traffic Class (TC) field of an IPv6 packet can be
selectively modified based upon the final color assigned to the packet when it
reaches the egress.  Note that the IPv4 header checksum will be updated -
but only if the IPv4 TOS field actually changes as a result of this
setting or the `odp_tm_drop_prec_marking()` setting.  For IPv6, since there is
no header checksum, nothing needs to be done. If ECN is enabled for a
particular color then ECN subfield will be set to _ECN_CE_  _i.e.,_ congestion
experienced.
`ecn_marking_supported` boolean in capability structure indicates whether this
feature is supported by the implementation.

==== Drop Precedence Marking.

The Drop precedence marking allows one to configure the TM
egress to support Assured forwarding in accordance with https://www.ietf.org/rfc/rfc2597.txt[RFC 2597].
The Drop Precedence bits are contained within the six bit Differentiated
Services Code Point subfield of the IPv4 TOS field or the IPv6 Traffic
Class (TC) field.  Specifically the Drop Precedence sub-subfield can be
accessed with a DSCP bit mask of 0x06.  When enabled for a given color,
these two bits will be set to Medium Drop Precedence (value 0x4) if the
color is ODP_PACKET_YELLOW, set to High Drop Precedence (value 0x6) if
the color is ODP_PACKET_RED.

Note that the IPv4 header checksum will be updated - but only if the
IPv4 TOS field actually changes as a result of this setting or the
`odp_tm_ecn_marking()` setting.  For IPv6, since there is no header checksum,
nothing else needs to be done.
`drop_prec_marking_supported` boolean in capability structure indicates whether
this feature is supported by the implementation.

=== Examples

.Create a tm_node chain for two nodes and associate the scheduler
[source,c]
----

   odp_tm_params_init(&tm_params); // <1>
   tm_params.pktio = egress_pktio;
   tm = odp_tm_create("Example TM", &tm_params);

/* create 5 input queues here - two at priority 1 and three at priority 2. */

   odp_tm_queue_params_init(&queue_params);
   queue_params.priority = 1;
   tmq_A1 = odp_tm_queue_create(tm, &queue_params);
   tmq_B1 = odp_tm_queue_create(tm, &queue_params);

   queue_params.priority = 2;
   tmq_A2 = odp_tm_queue_create(tm, &queue_params);
   tmq_B2 = odp_tm_queue_create(tm, &queue_params);
   tmq_C2 = odp_tm_queue_create(tm, &queue_params);

   odp_tm_node_params_init(&node_params); // <2>
   node_params.level = 1;
   tm_node_1 = odp_tm_node_create(tm, "TmNode1", &node_params);

   odp_tm_queue_connect(tmq_A1, tm_node_1); // <3>
   odp_tm_queue_connect(tmq_B1, tm_node_1);
   odp_tm_queue_connect(tmq_A2, tm_node_1);
   odp_tm_queue_connect(tmq_B2, tm_node_1);
   odp_tm_queue_connect(tmq_C2, tm_node_1);

/* It is IMPORTANT to understand that the following code does NOT create any
schedulers!  In fact there is NO call to create a tm scheduler that exists
inside of a tm_node.  Such an entity comes into existence as needed. What this
code does is create a scheduler PROFILE, which is effectively a registered set
of common scheduler parameters.  NOTE that this uses some pseudocode below
instead of real C code so as to be more concise. */

   odp_tm_sched_params_init(&sched_params); // <4>
   sched_params.sched_modes = { ODP_TM_FRAME_BASED_WEIGHTS, ... };
   sched_params.sched_weights = { 8, 8, 8,  ... };
   sched_profile_RR = odp_tm_sched_create("SchedProfileRR", &sched_params);

   sched_params.sched_modes = { ODP_TM_BYTE_BASED_WEIGHTS, ... };
   sched_params.sched_weights = { 8, 8, 8, ... };
   sched_profile_FQ = odp_tm_sched_create("SchedProfileFQ", &sched_params);

   odp_tm_queue_sched_config(tm_node_1, tmq_A1, sched_profile_RR); // <5>
   odp_tm_queue_sched_config(tm_node_1, tmq_B1, sched_profile_RR);
   odp_tm_queue_sched_config(tm_node_1, tmq_A2, sched_profile_FQ);
   odp_tm_queue_sched_config(tm_node_1, tmq_B2, sched_profile_FQ);
   odp_tm_queue_sched_config(tm_node_1, tmq_C2, sched_profile_FQ);

   odp_tm_node_params_init(&node_params); // <6>
   node_params.level = 2;
   tm_node_2 = odp_tm_node_create(tm, "TmNode2", &node_params);

   odp_tm_node_connect(tm_node_1, tm_node_2); // <7>

   odp_tm_sched_params_init(&sched_params); // <8>
   sched_params.sched_modes = { ODP_TM_BYTE_BASED_WEIGHTS, ... };
   sched_params.sched_weights = { 8, 16, 24,  ... };
   sched_profile_WFQ = odp_tm_sched_create("SchedProfileWFQ", &sched_params);

   odp_tm_node_sched_config(tm_node_2, tm_node_1, sched_profile_WFQ); // <9>
----

<1> Create a tm system, since that is a precursor to creating tm_queues.
<2> Create a Node #1
<3> Connect the Queue(s) to the Node -> odp_tm_queue_connect()
<4> Create two sets of scheduler params - one implementing Round Robin (since
all weights are the same - namely 8) and the second implementing Fair Queuing.

<5> Associate the Scheduler to the Node and the Queue(s) -> odp_tm_queue_sched_config()
Use the Round Robin profile for the priority 1 fan-ins and Fair Queuing
for the priority 2 fan-ins.

<6> Create a second Node #2
<7> Connect the first Node #1 to the second Node #2 -> odp_tm_node_connect()
<8> Create a Scheduler Profile
<9> Associate the Scheduler to the Node #1 and #2 -> odp_tm_node_sched_config()