ovn/TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366

-*- outline -*-

* L3 support

** OVN_Northbound schema

*** Needs to support extra routes

Currently a router port has a single route associated with it, but
presumably we should support multiple routes.  For connections from
one router to another, this doesn't seem to matter (just put more than
one connection between them), but for connections between a router and
a switch it might matter because a switch has only one router port.

** OVN_SB schema

*** Allow output to ingress port

Sometimes when a packet ingresses into a router, it has to egress the
same port.  One example is a "one-armed" router that has multiple
routes on a single port (or in which a host is (mis)configured to send
every IP packet to the router, e.g. due to a bad netmask).  Another is
when a router needs to send an ICMP reply to an ingressing packet.

To some degree this problem is layered, because there are two
different notions of "ingress port".  The first is the OpenFlow
ingress port, essentially a physical port identifier.  This is
implemented as part of ovs-vswitchd's OpenFlow implementation.  It
prevents a reply from being sent across the tunnel on which it
arrived.  It is questionable whether this OpenFlow feature is useful
to OVN.  (OVN already has to override it to allow a packet from one
nested container to be forwarded to a different nested container.)
OVS make it possible to disable this feature of OpenFlow by setting
the OpenFlow input port field to 0.  (If one does this too early, of
course, it means that there's no way to actually match on the input
port in the OpenFlow flow tables, but one can work around that by
instead setting the input port just before the output action, possibly
wrapping these actions in push/pop pairs to preserve the input port
for later.)

The second is the OVN logical ingress port, which is implemented in
ovn-controller as part of the logical abstraction, using an OVS
register.  Dropping packets directed to the logical ingress port is
implemented through an OpenFlow table not directly visible to the
logical flow table.  Currently this behavior can't be disabled, but
various ways to ensure it could be implemented, e.g. the same as for
OpenFlow by allowing the logical inport to be zeroed, or by
introducing a new action that ignores the inport.

** New OVN logical actions

*** arp

Generates an ARP packet based on the current IPv4 packet and allows it
to be processed as part of the current pipeline (and then pop back to
processing the original IPv4 packet).

TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
one per second for a given target.  We might need to do this too.

We probably need to buffer the packet that generated the ARP.  I don't
know where to do that.

*** icmp4 { action... }

Generates an ICMPv4 packet based on the current IPv4 packet and
processes it according to each nested action (and then pops back to
processing the original IPv4 packet).  The intended use case is for
generating "time exceeded" and "destination unreachable" errors.

ovn-sb.xml includes a tentative specification for this action.

Tentatively, the icmp4 action sets a default icmp_type and icmp_code
and lets the nested actions override it.  This means that we'd have to
make icmp_type and icmp_code writable.  Because changing icmp_type and
icmp_code can change the interpretation of the rest of the data in the
ICMP packet, we would want to think this through carefully.  If it
seems like a bad idea then we could instead make the type and code a
parameter to the action: icmp4(type, code) { action... }

It is worth considering what should be considered the ingress port for
the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
to go back out the ingress port.  Maybe the icmp4 action, therefore,
should clear the inport, so that output to the original inport won't
be discarded.

*** tcp_reset

Transforms the current TCP packet into a RST reply.

ovn-sb.xml includes a tentative specification for this action.

*** Other actions for IPv6.

IPv6 will probably need an action or actions for ND that is similar to
the "arp" action, and an action for generating

*** ovn-controller translation to OpenFlow

The following two translation strategies come to mind.  Some of the
new actions we might want to implement one way, some of them the
other, depending on the details.

*** Implementation strategies

One way to do this is to define new actions as Open vSwitch extensions
to OpenFlow, emit those actions in ovn-controller, and implement them
in ovs-vswitchd (possibly pushing the implementations into the Linux
and DPDK datapaths as well).  This is the only acceptable way for
actions that need high performance.  None of these actions obviously
need high performance, but it might be necessary to have fairness in
handling e.g. a flood of incoming packets that require these actions.
The main disadvantage of this approach is that it ties ovs-vswitchd
(and the Linux kernel module) to supporting these actions essentially
forever, which means that we'd want to make sure that they are
general-purpose, well designed, maintainable, and supportable.

The other way to do this is to send the packets across an OpenFlow
channel to ovn-controller and have ovn-controller process them.  This
is acceptable for actions that don't need high performance, and it
means that we don't add anything permanently to ovs-vswitchd or the
kernel (so we can be more casual about the design).  The big
disadvantage is that it becomes necessary to add a way to resume the
OpenFlow pipeline when it is interrupted in the middle by sending a
packet to the controller.  This is not as simple as doing a new flow
table lookup and resuming from that point.  Instead, it is equivalent
to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
Much of this logic can be translated into OpenFlow actions (e.g. the
call stack and data stack), but some of it is entirely outside
OpenFlow (e.g. the state of mirrors).  To implement it properly, it
seems that we'll have to introduce a new Open vSwitch extension to
OpenFlow, a "send-to-controller" action that causes extra data to be
sent to the controller, where the extra data packages up the state
necessary to resume the pipeline.  Maybe the bits of the state that
can be represented in OpenFlow can be embedded in this extra data in a
controller-readable form, but other bits we might want to be opaque.
It's also likely that we'll want to change and extend the form of this
opaque data over time, so this should be allowed for, e.g. by
including a nonce in the extra data that is newly generated every time
ovs-vswitchd starts.

*** OpenFlow action definitions

Define OpenFlow wire structures for each new OpenFlow action and
implement them in lib/ofp-actions.[ch].

*** OVS implementation

Add code for action translation.  Possibly add datapath code for
action implementation.  However, none of these new actions should
require high-bandwidth processing so we could at least start with them
implemented in userspace only.  (ARP field modification is already
userspace-only and no one has complained yet.)

** IPv6

*** ND versus ARP

*** IPv6 routing

*** ICMPv6

** Dynamic IP to MAC bindings

Some bindings from IP address to MAC will undoubtedly need to be
discovered dynamically through ARP requests.  It's straightforward
enough for a logical L3 router to generate ARP requests and forward
them to the appropriate switch.

It's more difficult to figure out where the reply should be processed
and stored.  It might seem at first that a first-cut implementation
could just keep track of the binding on the hypervisor that needs to
know, but that can't happen easily because the VM that sends the reply
might not be on the same HV as the VM that needs the answer (that is,
the VM that sent the packet that needs the binding to be resolved) and
there isn't an easy way for it to know which HV needs the answer.

Thus, the HV that processes the ARP reply (which is unknown when the
ARP is sent) has to tell all the HVs the binding.  The most obvious
place for this in the OVN_Southbound database.

Details need to be worked out, including:

*** OVN_Southbound schema changes.

Possibly bindings could be added to the Port_Binding table by adding
or modifying columns.  Another possibility is that another table
should be added.

*** Logical_Flow representation

It would be really nice to maintain the general-purpose nature of
logical flows, but these bindings might have to include some
hard-coded special cases, especially when it comes to the relationship
with populating the bindings into the OVN_Southbound table.

*** Tracking queries

It's probably best to only record in the database responses to queries
actually issued by an L3 logical router, so somehow they have to be
tracked, probably by putting a tentative binding without a MAC address
into the database.

*** Renewal and expiration.

Something needs to make sure that bindings remain valid and expire
those that become stale.

** MTU handling (fragmentation on output)

** Ratelimiting.

*** ARP.

*** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...

As a point of comparison, Linux doesn't ratelimit TCP resets but I
think it does everything else.

* ovn-controller

** ovn-controller parameters and configuration.

*** SSL configuration.

    Can probably get this from Open_vSwitch database.

** Security

*** Limiting the impact of a compromised chassis.

    Every instance of ovn-controller has the same full access to the central
    OVN_Southbound database.  This means that a compromised chassis can
    interfere with the normal operation of the rest of the deployment.  Some
    specific examples include writing to the logical flow table to alter
    traffic handling or updating the port binding table to claim ports that are
    actually present on a different chassis.  In practice, the compromised host
    would be fighting against ovn-northd and other instances of ovn-controller
    that would be trying to restore the correct state.  The impact could include
    at least temporarily redirecting traffic (so the compromised host could
    receive traffic that it shouldn't) and potentially a more general denial of
    service.

    There are different potential improvements to this area.  The first would be
    to add some sort of ACL scheme to ovsdb-server.  A proposal for this should
    first include an ACL scheme for ovn-controller.  An example policy would
    be to make Logical_Flow read-only.  Table-level control is needed, but is
    not enough.  For example, ovn-controller must be able to update the Chassis
    and Encap tables, but should only be able to modify the rows associated with
    that chassis and no others.

    A more complex example is the Port_Binding table.  Currently, ovn-controller
    is the source of truth of where a port is located.  There seems to be  no
    policy that can prevent malicious behavior of a compromised host with this
    table.

    An alternative scheme for port bindings would be to provide an optional mode
    where an external entity controls port bindings and make them read-only to
    ovn-controller.  This is actually how OpenStack works today, for example.
    The part of OpenStack that manages VMs (Nova) tells the networking component
    (Neutron) where a port will be located, as opposed to the networking
    component discovering it.

* ovsdb-server

  ovsdb-server should have adequate features for OVN but it probably
  needs work for scale and possibly for availability as deployments
  grow.  Here are some thoughts.

  Andy Zhou is looking at these issues.

*** Reducing amount of data sent to clients.

    Currently, whenever a row monitored by a client changes,
    ovsdb-server sends the client every monitored column in the row,
    even if only one column changes.  It might be valuable to reduce
    this only to the columns that changes.

    Also, whenever a column changes, ovsdb-server sends the entire
    contents of the column.  It might be valuable, for columns that
    are sets or maps, to send only added or removed values or
    key-values pairs.

    Currently, clients monitor the entire contents of a table.  It
    might make sense to allow clients to monitor only rows that
    satisfy specific criteria, e.g. to allow an ovn-controller to
    receive only Logical_Flow rows for logical networks on its hypervisor.

*** Reducing redundant data and code within ovsdb-server.

    Currently, ovsdb-server separately composes database update
    information to send to each of its clients.  This is fine for a
    small number of clients, but it wastes time and memory when
    hundreds of clients all want the same updates (as will be in the
    case in OVN).

    (This is somewhat opposed to the idea of letting a client monitor
    only some rows in a table, since that would increase the diversity
    among clients.)

*** Multithreading.

    If it turns out that other changes don't let ovsdb-server scale
    adequately, we can multithread ovsdb-server.  Initially one might
    only break protocol handling into separate threads, leaving the
    actual database work serialized through a lock.

** Increasing availability.

   Database availability might become an issue.  The OVN system
   shouldn't grind to a halt if the database becomes unavailable, but
   it would become impossible to bring VIFs up or down, etc.

   My current thought on how to increase availability is to add
   clustering to ovsdb-server, probably via the Raft consensus
   algorithm.  As an experiment, I wrote an implementation of Raft
   for Open vSwitch that you can clone from:

       https://github.com/blp/ovs-reviews.git raft

** Reducing startup time.

   As-is, if ovsdb-server restarts, every client will fetch a fresh
   copy of the part of the database that it cares about.  With
   hundreds of clients, this could cause heavy CPU load on
   ovsdb-server and use excessive network bandwidth.  It would be
   better to allow incremental updates even across connection loss.
   One way might be to use "Difference Digests" as described in
   Epstein et al., "What's the Difference? Efficient Set
   Reconciliation Without Prior Context".  (I'm not yet aware of
   previous non-academic use of this technique.)

** Support multiple tunnel encapsulations in Chassis.

   So far, both ovn-controller and ovn-controller-vtep only allow
   chassis to have one tunnel encapsulation entry.  We should extend
   the implementation to support multiple tunnel encapsulations.

** Update learned MAC addresses from VTEP to OVN

   The VTEP gateway stores all MAC addresses learned from its
   physical interfaces in the 'Ucast_Macs_Local' and the
   'Mcast_Macs_Local' tables.  ovn-controller-vtep should be
   able to update that information back to ovn-sb database,
   so that other chassis know where to send packets destined
   to the extended external network instead of broadcasting.

** Translate ovn-sb Multicast_Group table into VTEP config

   The ovn-controller-vtep daemon should be able to translate
   the Multicast_Group table entry in ovn-sb database into
   Mcast_Macs_Remote table configuration in VTEP database.

* Use BFD as tunnel monitor.

   Both ovn-controller and ovn-contorller-vtep should use BFD to
   monitor the tunnel liveness.  Both ovs-vswitchd schema and
   VTEP schema supports BFD.

* ACL

** Support FTP ALGs.

** Support reject action.

** Support log option.