Note: This reference manual is a draft. The API defined in this document is not guaranteed to be stable or complete and future versions of Snabb will introduce backwards incompatible changes. With that being said, discrepancies between this document and the actual Snabb Switch implementation are considered to be bugs. Please report them in order to help improve this document.
Snabb is an extensible, virtualized, Ethernet networking toolkit. With Snabb you can implement networking applications using the Lua language. Snabb includes all the tools you need to quickly realize your network designs and its really fast too! Furthermore, Snabb is extensible and encourages you to grow the ecosystem to match your requirements.
The Snabb Core forms a runtime environment (engine) which executes your design. A design is simply a Lua script used to drive the Snabb stack, you can think of it as your top-level “main” routine.
In order to add functionality to the Snabb stack you can load modules into the Snabb engine. These can be Lua modules as well as native code objects. We differentiate between two classes of modules, namely libraries and Apps. Libraries are simple collections of program utilities to be used in your designs, apps or other libraries, just as you might expect. Apps, on the other hand, are code objects that implement a specific interface, which is used by the Snabb engine to organize an App Network.
Usually, a Snabb design will create a series of apps, interconnect these in a desired way using links and finally pass the resulting app network on to the Snabb engine. The engine’s job is to:
The core modules defined below can be loaded using Lua’s require
. For example:
local config = require("core.config")
local c = config.new()
...
An app is an isolated implementation of a specific networking function. For example, a switch, a router, or a packet filter.
Apps receive packets on input ports, perform some processing, and transmit packets on output ports. Each app has zero or more input and output ports. For example, a packet filter may have one input and one output port, while a packet recorder may have only an input port. Every app must implement the interface below. Methods which may be left unimplemented are marked as “optional”.
— Method myapp:new arg
Required. Create an instance of the app with a given argument arg. Myapp:new
must return an instance of the app. The handling of arg is up to the app but it is encouraged to use core.config
’s parse_app_arg
to parse arg.
— Field myapp.input
— Field myapp.output
Tables of named input and output links. These tables are initialized by the engine for use in processing and are read-only.
— Field myapp.appname
Name of the app. Read-only.
— Field myapp.shm
Can be set to a specification for core.shm.create_frame
. When set, this field will be initialized to a frame of shared memory objects by the engine.
— Field myapp.config
Can be set to a specification for core.lib.parse
. When set, the specification will be used to validate the app’s arg when it is configured using config.app
.
— Method myapp:link
Optional. Called any time the app’s links may have been changed (including on start-up). Guaranteed to be called before pull
and push
are called with new links.
— Method myapp:pull
Optional. Pull packets into the network.
For example: Pull packets from a network adapter into the app network by transmitting them to output ports.
— Method myapp:push
Optional. Push packets through the system.
For example: Move packets from input ports to output ports or to a network adapter.
— Method myapp:reconfig arg
Optional. Reconfigure the app with a new arg. If this method is not implemented the app instance is discarded and a new instance is created.
— Method myapp:report
Optional. Print a report of the current app status.
— Method myapp:stop
Optional. Stop the app and release associated external resources.
— Field myapp.zone
Optional. Name of the LuaJIT profiling zone used for this app (descriptive string). The default is the module name.
A config is a description of a packet-processing network. The network is a directed graph. Nodes in the graph are apps that each process packets in a specific way. Each app has a set of named input and output ports—often called rx and tx. Edges of the graph are unidirectional links that carry packets from an output port to an input port.
The config is a purely passive data structure. Creating and manipulating a config object does not immediately affect operation. The config has to be activated using engine.configure
.
— Function config.new
Creates and returns a new empty configuration.
— Function config.app config, name, class, arg
Adds an app of class with arg to the config where it will be assigned to name.
Example:
config.app(c, "nic", Intel82599, {pciaddr = "0000:00:00.0"})
— Function config.link config, linkspec
Add a link defined by linkspec to the config config. Linkspec must be a string of the format
app_name1.output_port->app_name2.input_port
where app_name1
and app_name2
are names of apps in config and output_port
and input_port
are valid output and input ports of the referenced apps respectively.
Example:
config.link(c, "nic1.tx->nic2.rx")
The engine executes a config by initializing apps, creating links, and driving the flow of execution. The engine also performs profiling and reporting functions. It can be reconfigured during runtime. Within Snabb Switch scripts the core.app
module is bound to the global engine
variable.
— Function engine.configure config
Configure the engine to use a new config config. You can safely call this method many times to incrementally update the running app network. The engine updates the app network as follows:
stop()
method is called if defined.)reconfig()
method. If the reconfig()
method is not implemented then the old instance is stopped a new one started.— Function engine.main options
Run the Snabb engine. Options is a table of key/value pairs. The following keys are recognized:
duration
- Duration in seconds to run the engine for (as a floating point number). If this is set you cannot supply done
.done
- A function to be called repeatedly by engine.main
until it returns true
. Once it returns true
the engine will be stopped and engine.main
will return. If this is set you cannot supply duration
.report
- A table which configures the report printed before engine.main()
returns. The keys showlinks
and showapps
can be set to boolean values to force or suppress link and app reporting individually. By default `engine.main()’ will report on links but not on apps.measure_latency
- By default, the breathe()
loop is instrumented to record the latency distribution of running the app graph. This information can be processed by the snabb top
program. Passing measure_latency=false
in the options will disable this instrumentation.no_report
- A boolean value. If true
no final report will be printed.— Function engine.stop
Stop all apps in the engine by loading an empty configuration.
— Function engine.now
Returns monotonic time in seconds as a floating point number. Suitable for timers.
— Variable engine.busywait
If set to true then the engine polls continuously for new packets to process. This consumes 100% CPU and makes processing latency less vulnerable to kernel scheduling behavior which can cause pauses of more than one millisecond.
Default: false
— Variable engine.Hz
Frequency at which to poll for new input packets. The default value is ‘false’ which means to adjust dynamically up to 100us during low traffic. The value can be overridden with a constant integer saying how many times per second to poll.
This setting is not used when engine.busywait is true.
A link is a ring buffer used to store packets between apps. Links can be treated either like arrays—accessing their internal structure directly—or as streams of packets by using their API functions.
— Function link.empty link
Predicate used to test if a link is empty. Returns true if link is empty and false otherwise.
— Function link.full link
Predicate used to test if a link is full. Returns true if link is full and false otherwise.
— Function link.nreadable link
Returns the number of packets on link.
— Function link.nwriteable link
Returns the remaining number of packets that fit onto link.
— Function link.receive link
Returns the next available packet (and advances the read cursor) on link. If the link is empty an error is signaled.
— Function link.front link
Return the next available packet without advancing the read cursor on link. If the link is empty, nil
is returned.
— Function link.transmit link, packet
Transmits packet onto link. If the link is full packet is dropped (and the drop counter increased).
— Function link.stats link
Returns a structure holding ring statistics for the link:
txbytes
, rxbytes
: Counts of transferred bytes.txpackets
, rxpackets
: Counts of transferred packets.txdrop
: Count of packets dropped due to ring overflow.A packet is an FFI object of type struct packet
representing a network packet that is currently being processed. The packet is used to explicitly manage the life cycle of the packet. Packets are explicitly allocated and freed by using packet.allocate
and packet.free
. When a packet is received using link.receive
its ownership is acquired by the calling app. The app must then ensure to either transfer the packet ownership to another app by calling link.transmit
on the packet or free the packet using packet.free
. Apps may only use packets they own, e.g. packets that have not been transmitted or freed. The number of allocatable packets is limited by the size of the underlying “freelist”, e.g. a pool of unused packet objects from and to which packets are allocated and freed.
— Type struct packet
struct packet {
uint16_t length;
uint8_t data[packet.max_payload];
};
— Constant packet.max_payload
The maximum payload length of a packet.
— Function packet.allocate
Returns a new empty packet. An an error is raised if there are no packets left on the freelist. Initially the length
of the allocated is 0, and its data
is uninitialized garbage.
— Function packet.free packet
Frees packet and puts in back onto the freelist.
— Function packet.clone packet
Returns an exact copy of packet.
— Function packet.resize packet, length
Sets the payload length of packet, truncating or extending its payload. In the latter case the contents of the extended area at the end of the payload are filled with zeros.
— Function packet.append packet, pointer, length
Appends length bytes starting at pointer to the end of packet. An error is raised if there is not enough space in packet to accomodate length additional bytes.
— Function packet.prepend packet, pointer, length
Prepends length bytes starting at pointer to the front of packet, taking ownership of the packet and returning a new packet. An error is raised if there is not enough space in packet to accomodate length additional bytes.
— Function packet.shiftleft packet, length
Take ownership of packet, truncate it by length bytes from the front, and return a new packet. Length must be less than or equal to length
of packet.
— Function packet.shiftright packet, length
Take ownership of packet, moves packet payload to the right by length bytes, growing packet by length. Returns a new packet. The sum of length and length
of packet must be less than or equal to packet.max_payload
.
— Function packet.from_pointer pointer, length
Allocate packet and fill it with length bytes from pointer.
— Function packet.from_string string
Allocate packet and fill it with the contents of string.
— Function **packet.clone_to_memory* pointer packet
Creates an exact copy of at memory pointed to by pointer. Pointer must point to a packet.packet_t
.
Snabb allocates special DMA memory that can be accessed directly by network cards. The important characteristic of DMA memory is being located in contiguous physical memory at a stable address.
— Function memory.dma_alloc bytes, [alignment]
Returns a pointer to bytes of new DMA memory.
Optionally a specific alignment requirement can be provided (in bytes). The default alignment is 128.
— Function memory.virtual_to_physical pointer
Returns the physical address (uint64_t
) the DMA memory at pointer.
— Variable memory.huge_page_size
Size of a single huge page in bytes. Read-only.
This module facilitates creation and management of named shared memory objects. Objects can be created using shm.create
similar to ffi.new
, except that separate calls to shm.open
for the same name will each return a new mapping of the same shared memory. Different processes can share memory by mapping an object with the same name (and type). Each process can map any object any number of times.
Mappings are deleted on process termination or with an explicit shm.unmap
. Names are unlinked from objects that are no longer needed using shm.unlink
. Object memory is freed when the name is unlinked and all mappings have been deleted.
Names can be fully qualified or abbreviated to be within the current process. Here are examples of names and how they are resolved where <pid>
is the PID of this process:
foo/bar
⇒ /var/run/snabb/<pid>/foo/bar
/1234/foo/bar
⇒ /var/run/snabb/1234/foo/bar
Behind the scenes the objects are backed by files on ram disk (/var/run/snabb/<pid>
) and accessed with the equivalent of POSIX shared memory (shm_overview(7)
). The files are automatically removed on shutdown unless the environment SNABB_SHM_KEEP
is set. The location /var/run/snabb
can be overridden by the environment variable SNABB_SHM_ROOT
.
Shared memory objects are created world-readable for convenient access by diagnostic tools. You can lock this down by setting SNABB_SHM_ROOT
to a path under a directory with appropriate permissions.
The practical limit on the number of objects that can be mapped will depend on the operating system limit for memory mappings. On Linux the default limit is 65,530 mappings:
$ sysctl vm.max_map_count vm.max_map_count = 65530
— Function shm.create name, type
Creates and maps a shared object of type into memory via a hierarchical name. Returns a pointer to the mapped object.
— Function shm.open name, type, [readonly]
Maps an existing shared object of type into memory via a hierarchical name. If readonly is non-nil the shared object is mapped in read-only mode. Readonly defaults to nil. Fails if the shared object does not already exist. Returns a pointer to the mapped object.
— Function shm.alias new-path existing-path
Create an alias (symbolic link) for an object.
— Function shm.path name
Returns the fully-qualified path for an object called name.
— Function shm.exists name
Returns a true value if shared object by name exists.
— Function shm.unmap pointer
Deletes the memory mapping for pointer.
— Function shm.unlink path
Unlinks the subtree of objects designated by path from the filesystem.
— Function shm.children path
Returns an array of objects in the directory designated by path.
— Function shm.register type, module
Registers an abstract shared memory object type implemented by module in shm.types
. Module must provide the following functions:
and can optionally provide the function:
The module’s type
variable must be bound to type. To register a new type a module might invoke shm.register
like so:
type = shm.register('mytype', getfenv())
-- Now the following holds true:
-- shm.types[type] == getfenv()
— Variable shm.types
A table that maps types to modules. See shm.register
.
— Function shm.create_frame path, specification
Creates and returns a shared memory frame by specification under path. A frame is a table of mapped—possibly abstract‑shared memory objects. Specification must be of the form:
{ <name> = {<module>, ...},
... }
Module must implement an abstract type registered with shm.register
, and is followed by additional initialization arguments to its create
function. Example usage:
local counter = require("core.counter")
-- Create counters foo/bar/{dtime,rxpackets,txpackets}.counter
local f = shm.create_frame(
"foo/bar",
{dtime = {counter, C.get_unix_time()},
rxpackets = {counter},
txpackets = {counter}})
counter.add(f.rxpackets)
counter.read(f.dtime)
— Function shm.open_frame path
Opens and returns the shared memory frame under path for reading.
— Function shm.delete_frame frame
Deletes/unmaps a shared memory frame. The frame directory is unlinked if frame was created by shm.create_frame
.
Double-buffered shared memory counters. Counters are 64-bit unsigned values. Registered with core.shm
as type counter
.
— Function counter.create name, [initval]
Creates and returns a counter
by name, initialized to initval. Initval defaults to 0.
— Function counter.open name
Opens and returns the counter by name for reading.
— Function counter.delete name
Deletes and unmaps the counter by name.
— Function counter.commit
Commits buffered counter values to public shared memory.
— Function counter.set counter, value
Sets counter to value.
— Function counter.add counter, [value]
Increments counter by value. Value defaults to 1.
— Function counter.read counter
Returns the value of counter.
Shared memory histogram with logarithmic buckets. Registered with core.shm
as type histogram
.
— Function histogram.new min, max
Returns a new histogram
, with buckets covering the range from min to max. The range between min and max will be divided logarithmically.
— Function histogram.create name, min, max
Creates and returns a histogram
as in histogram.new
by name. If the file exists already, it will be cleared.
— Function histogram.open name
Opens and returns histogram
by name for reading.
— Method histogram:add measurement
Adds measurement to histogram.
— Method histogram:iterate prev
When used as for count, lo, hi in histogram:iterate()
, visits all buckets in histogram in order from lowest to highest. Count is the number of samples recorded in that bucket, and lo and hi are the lower and upper bounds of the bucket. Note that count is an unsigned 64-bit integer; to get it as a Lua number, use tonumber
.
If prev is given, it should be a snapshot of the previous version of the histogram. In that case, the count values will be returned as a difference between their values in histogram and their values in prev.
— Method histogram:snapshot [dest]
Copies out the contents of histogram into the histogram
dest and returns dest. If dest is not given, the result will be a fresh histogram
.
— Method histogram:clear
Clears the buckets of histogram.
— Method **histogram:wrap_thunk* thunk, now
Returns a closure that wraps thunk, measuring and recording the difference between calls to now before and after thunk into histogram.
— Method **histogram:summarize* prev
Returns the approximate minimum, average, and maximum values recorded in histogram.
If prev is given, it should be a snapshot of a previous version of the histogram. In that case, this method returns the approximate minimum, average and maximum values for the difference between histogram and prev.
The core.lib
module contains miscellaneous utilities.
— Function lib.equal x, y
Predicate to test if x and y are structurally similar (isomorphic).
— Function lib.can_open filename, mode
Predicate to test if file at filename can be successfully opened with mode.
— Function lib.can_read filename
Predicate to test if file at filename can be successfully opened for reading.
— Function lib.can_write filename
Predicate to test if file at filename can be successfully opened for writing.
— Function lib.readcmd command, what
Runs Unix shell command and returns what of its output. What must be a valid argument to file:read
.
— Function lib.readfile filename, what
Reads and returns what from file at filename. What must be a valid argument to file:read
.
— Function lib.writefile filename, value
Writes value to file at filename using file:write
. Returns the value returned by file:write
.
— Function lib.readlink filename
Returns the true name of symbolic link at filename.
— Function lib.dirname filename
Returns the dirname(3)
of filename.
— Function lib.basename filename
Returns the basename(3)
of filename.
— Function lib.firstfile directory
Returns the filename of the first file in directory.
— Function lib.firstline filename
Returns the first line of file at filename as a string.
— Function lib.load_string string
Evaluates and returns the value of the Lua expression in string.
— Function lib.load_conf filename
Evaluates and returns the value of the Lua expression in file at filename.
— Function lib.store_conf filename, value
Writes value to file at filename as a Lua expression. Supports tables, strings and everything that can be readably printed using print
.
— Function lib.bits bitset, basevalue
Returns a bitmask using the values of bitset as indexes. The keys of bitset are ignored (and can be used as comments).
Example:
bits({RESET=0,ENABLE=4}, 123) => 1<<0 | 1<<4 | 123
— Function lib.bitset value, n
Predicate to test if bit number n of value is set.
— Function lib.bitfield size, struct, member, offset, nbits, value
Combined accesor and setter function for bit ranges of integers in cdata structs. Sets nbits (number of bits) starting from offset to value. If value is not given the current value is returned.
Size may be one of 8, 16 or 32 depending on the bit size of the integer being set or read.
Struct must be a pointer to a cdata object and member must be the literal name of a member of struct.
Example:
local struct_t = ffi.typeof[[struct { uint16_t flags; }]]
-- Assuming `s' is an instance of `struct_t', set bits 4-7 to 0xF:
lib.bitfield(16, s, 'flags', 4, 4, 0xf)
-- Get the value:
lib.bitfield(16, s, 'flags', 4, 4) -- => 0xF
— Function string:split pattern
Returns an iterator over the string split by pattern. Pattern must be a valid argument to string:gmatch
.
Example:
for word, sep in ("foo!bar!baz"):split("(!)") do
print(word, sep)
end
> foo !
> bar !
> baz nil
— Function lib.hexdump string
Returns hexadecimal string for bytes in string.
— Function lib.hexundump hexstring, n, error
Returns string of n bytes for hexstring. Throws an error if less than n hex-encoded bytes could be parsed unless error is false
.
Error is optional and can be the error message to throw.
— Function lib.comma_value n
Returns a string for decimal number n with magnitudes separated by commas. Example:
comma_value(1000000) => "1,000,000"
— Function lib.random_bytes_from_dev_urandom length
Return length bytes of random data, as a byte array, taken from /dev/urandom
. Suitable for cryptographic usage.
— Function lib.random_bytes_from_math_random length
Return length bytes of random data, as a byte array, where each byte was taken from math.random(0, 255)
. Not suitable for cryptographic usage.
— Function lib.random_bytes length — Function lib.randomseed seed
Initialize Snabb’s random number generation facility. If seed is nil, then the Lua math.random()
function will be seeded from /dev/urandom
, and lib.random_bytes
will be initialized to lib.random_bytes_from_dev_urandom
. This is Snabb’s default mode of operation.
Sometimes it’s useful to make Snabb use deterministic random numbers. In that case, pass a seed to lib.randomseed; Snabb will set lib.random_bytes
to lib.random_bytes_from_math_random
, and also print out a message to stderr indicating that we are using lower-quality deterministic random numbers.
As part of its initialization process, Snabb will call lib.randomseed
with the value of the SNABB_RANDOM_SEED
environment variable (if any). Set this environment variable to enable deterministic random numbers.
— Function lib.bounds_checked type, base, offset, size
Returns a table that acts as a bounds checked wrapper around a C array of type and size starting at base plus offset. Type must be a ctype and the caller must ensure that the allocated memory region at base/offset is at least sizeof(type)*size
bytes long.
— Function lib.throttle seconds
Return a closure that returns true
at most once during any seconds (a floating point value) time interval, otherwise false.
— Function lib.timeout seconds
Returns a closure that returns true
if seconds (a floating point value) have elapsed since it was created, otherwise false.
— Function lib.waitfor condition
Blocks until the function condition returns a true value.
— Function lib.waitfor2 name, condition, attempts, interval
Repeatedly calls the function condition in interval (milliseconds). If condition returns a true value waitfor2
returns. If condition does not return a true value after attempts waitfor2
raises an error identified by name.
— Function lib.yesno flag
Returns the string "yes"
if flag is a true value and "no"
otherwise.
— Function lib.align value, size
Return the next integer that is a multiple of size starting from value.
— Function lib.csum pointer, length
Computes and returns the “IP checksum” length bytes starting at pointer.
— Function lib.update_csum pointer, length, checksum
Returns checksum updated by length bytes starting at pointer. The default of checksum is 0LL
.
— Function lib.finish_csum checksum
Returns the finalized checksum.
— Function lib.malloc etype
Returns a pointer to newly allocated DMA memory for etype.
— Function lib.deepcopy object
Returns a copy of object. Supports tables as well as ctypes.
— Function lib.array_copy array
Returns a copy of array. Array must not be a “sparse array”.
— Function lib.htonl n
— Function lib.htons n
Host to network byte order conversion functions for 32 and 16 bit integers n respectively. Unsigned.
— Function lib.ntohl n
— Function lib.ntohs n
Network to host byte order conversion functions for 32 and 16 bit integers n respectively. Unsigned.
— Function lib.random_bytes count
Return a fresh array of count random bytes. Suitable for cryptographic usage.
— Function lib.parse arg, config
Validates arg against the specification in config, and returns a fresh table containing the parameters in arg and any omitted optional parameters with their default values. Given arg, a table of parameters or nil
, assert that from config all of the required keys are present, fill in any missing values for optional keys, and error if any unknown keys are found. Config has the following format:
config := { key = {[required=boolean], [default=value]}, ... }
Each key is optional unless required
is set to a true value, and its default value defaults to nil
.
Example:
lib.parse({foo=42, bar=43}, {foo={required=true}, bar={}, baz={default=44}})
=> {foo=42, bar=43, baz=44}
Reads a variable number of arguments and returns a table representing a set. The returned value can be used to query whether an element belongs or not to the set.
Example:
local t = set('foo', 'bar')
t['foo'] -- yields true.
t['quax'] -- yields false.
Snabb can operate as a group of cooperating processes. The main process is the initial one that you start directly. The optional worker processes are children spawned when the main process calls the core.worker
module.
Each worker is a complete Snabb process. They can define app networks, run the engine, and do everything else that ordinary Snabb processes do. The exact behavior of each worker is determined by a Lua expression provided upon creation.
Groups of Snabb processes each have the following special properties:
kill -9
.memory.dma_alloc()
are usable by all processes in the group. This means that you can share DMA memory pointers between processes, for example via shm
shared memory objects, and reference them from any process. (The memory is automatically mapped at the expected address via a SEGV
signal handler.)The core.worker
API functions are available in the main process only:
— Function worker.start name luacode
Start a named worker process. The worker starts with a completely fresh Snabb process image (fork()+execve()
) and then executes the string luacode as a Lua source code expression.
Example:
worker.start("myworker", [[
print("hello world, from a Snabb worker process!")
print("could configure and run the engine now...")
]])
— Function worker.stop name
Stop a named worker process. The worker is abruptly killed.
Example:
worker.stop("myworker")
— Function worker.status
Return a table summarizing the status of all workers. The table key is the worker name and the value is a table with pid
and alive
attributes.
Example:
for w, s in pairs(worker.status()) do
print((" worker %s: pid=%s alive=%s"):format(
w, s.pid, s.alive))
end
Output:
worker w3: pid=21949 alive=true
worker w1: pid=21947 alive=true
worker w2: pid=21948 alive=true
Snabb designs can be run either with:
snabb <snabb-arg>* <design> <design-arg>*
or
#!/usr/bin/env snabb <snabb-arg>*
...
The main module provides an interface for running Snabb scripts. It exposes various operating system functions to scripts.
— Field main.parameters
A list of command-line arguments to the running script. Read-only.
— Function main.exit status
Cleanly exits the process with status.
The module apps.basic.basic_apps provides apps with general functionality for use in you app networks.
The Source
app is a synthetic packet generator. On each breath it fills each attached output link with new packets. It accepts a number as its configuration argument which is the byte size of the generated packets. By default, each packet is 60 bytes long. The packet data is initialized with zero bytes.
The Join
app joins together packets from N input links onto one output link. On each breath it outputs as many packets as possible from the inputs onto the output.
The Split
app splits packets from multiple inputs across multiple outputs. On each breath it transfers as many packets as possible from the input links to the output links.
The Sink
app receives all packets from any number of input links and discards them. This can be handy in combination with a Source
.
The Tee
app receives all packets from any number of input links and transfers each received packet to all output links. It can be used to merge and/or duplicate packet streams
The Repeater
app collects all packets received from the input
link and repeatedly transfers the accumulated packets to the output
link. The packets are transmitted in the order they were received.
The Truncate
app sends all packets received from the input
to the output
link and truncates or zero pads each packet to a given length. It accepts a number as its configuration argument which is the length of the truncated or padded packets.
The Sample
app forwards packets every nth packet from the input
link to the output
link, and drops all others packets. It accepts a number as its configuration argument which is n.
The Intel82599
drives one port of an Intel 82599 Ethernet controller. Packets taken from the rx
port are transmitted onto the network. Packets received from the network are put on the tx
port.
— Method Intel82599.dev:get_rxstats
Returns a table with the following keys:
counter_id
- Counter idpackets
- Number of packets receiveddropped
- Number of packets droppedbytes
- Total bytes received— Method Intel82599.dev:get_txstats
Returns a table with the following keys:
counter_id
- Counter idpackets
- Number of packets sentbytes
- Total bytes sentThe Intel82599
app accepts a table as its configuration argument. The following keys are defined:
— Key pciaddr
Required. The PCI address of the NIC as a string.
— Key macaddr
Optional. The MAC address to use as a string. The default is a wild-card (e.g. accept all packets).
— Key vlan
Optional. A twelve bit integer (0-4095). If set, incoming packets from other VLANs are dropped and outgoing packets are tagged with a VLAN header.
— Key vmdq
Optional. Boolean, defaults to false. Enables interface virtualization. Allows to have multiple Intel82599
apps per port. If enabled, macaddr must be specified.
— Key mirror
Optional. A table. If set, this app will receive copies of all selected packets on the physical port. The selection is configured by setting keys of the mirror table. Either mirror.pool or mirror.port may be set.
If mirror.pool is true
all pools defined on this physical port are mirrored. If mirror.pool is an array of pool numbers then the specified pools are mirrored.
If mirror.port is one of “in”, “out” or “inout” all incoming and/or outgoing packets on the port are mirrored respectively. Note that this does not include internal traffic which does not enter or exit through the physical port.
— Key rxcounter
— Key txcounter
Optional. Four bit integers (0-15). If set, incoming/outgoing packets will be counted in the selected statistics counter respectively. Multiple apps can share a counter. To retrieve counter statistics use Intel82599.dev:get_rxstats()
and Intel82599.dev:get_txstats()
.
— Key rate_limit
Optional. Number. Limits the maximum Mbit/s to transmit. Default is 0 which means no limit. Only applies to outgoing traffic.
— Key priority
Optional. Floating point number. Weight for the round-robin algorithm used to arbitrate transmission when rate_limit is not set or adds up to more than the line rate of the physical port. Default is 1.0 (scaled to the geometric middle of the scale which goes from 1/128 to 128). The absolute value is not relevant, instead only the ratio between competing apps controls their respective bandwidths. Only applies to outgoing traffic.
For example, if two apps without rate_limit set have the same priority, both get the same output bandwidth. If the priorities are 3.0/1.0, the output bandwidth is split 75%/25%. Likewise, 1.0/0.333 or 1.5/0.5 yield the same result.
Note that even a low-priority app can use the whole line rate unless other (higher priority) apps are using up the available bandwidth.
The Intel82599
app can transmit and receive at approximately 10 Mpps per processor core.
Each physical Intel 82599 port supports the use of up to:
Intel82599
app instances)macaddr
configuration option)vlan
configuration option)mirror
configuration option)LoadGen
is a load generator app based on the Intel 82599 Ethernet controller. It reads up to 32,000 packets from the input
port and transmits them repeatedly onto the network. All incoming packets are dropped.
The LoadGen
app accepts a string as its configuration argument. The given string denotes the PCI address of the NIC to use.
The LoadGen
app can transmit at line-rate (14 Mpps) without significant CPU usage.
The intel_mp.Intel
app provides drivers for Intel i210/i250/82599 based network cards. The driver exposes multiple receive and transmit queues that can be attached to separate instances of the app on different processes.
The links are named input
and output
.
If attaching multiple processes to a single NIC, performance appears better with engine.busywait = false
.
The intel_mp.Intel
app can drive an Intel 82599 NIC at 14 million pps.
— Method Intel:get_rxstats
Returns a table with the following keys:
counter_id
- Counter idpackets
- Number of packets receiveddropped
- Number of packets droppedbytes
- Total bytes received— Method Intel:get_txstats
Returns a table with the following keys:
counter_id
- Counter idpackets
- Number of packets sentbytes
- Total bytes sent— Key pciaddr
Required. The PCI address of the NIC as a string.
— Key ndesc
Optional. Number of DMA descriptors to use i.e. size of the DMA transmit and receive queues. Must be a multiple of 128. Default is not specified but assumed to be broadly applicable.
— Key rxq
Optional. The receive queue to attach to, numbered from 0. The default is 0. When VMDq is enabled, this number is used to index a queue (0 or 1) for the selected pool. Passing false
will disable the receive queue.
— Key txq
Optional. The transmit queue to attach to, numbered from 0. The default is 0. Passing false
will disable the transmit queue.
— Key vmdq
Optional. A boolean parameter that specifies whether VMDq (Virtual Machine Device Queues) is enabled. When VMDq is enabled, each instance of the driver is associated with a pool that can be assigned a MAC address or VLAN tag. Packets are delivered to pools that match the corresponding MACs or VLAN tags. Each pool may be associated with several receive and transmit queues.
For a given NIC, all driver instances should have this parameter either enabled or disabled uniformly. If this is enabled, macaddr must be specified.
— Key vmdq_queueing_mode
Optional. Sets the queueing mode to use in VMDq mode. Has no effect when VMDq is disabled. The available queueing modes for the 82599 are "rss-64-2"
(the default with 64 pools, 2 queues each) and "rss-32-4"
(32 pools, 4 queues each). The i350 provides only a single mode (8 pools, 2 queues each) and hence ignores this option.
— Key poolnum
Optional. The VMDq pool to associate with, numbered from 0. The default is to select a pool number automatically. The maximum pool number depends on the queueing mode.
— Key macaddr
Optional. The MAC address to use as a string. The default is a wild-card (i.e., accept all packets).
— Key vlan Optional. A twelve-bit integer (0-4095). If set, incoming packets from other VLANs are dropped and outgoing packets are tagged with a VLAN header.
— Key mirror
Optional. A table. If set, this app will receive copies of all selected packets on the physical port. The selection is configured by setting keys of the mirror table. Either mirror.pool or mirror.port may be set.
If mirror.pool is true
all pools defined on this physical port are mirrored. If mirror.pool is an array of pool numbers then the specified pools are mirrored.
If mirror.port is one of “in”, “out” or “inout” all incoming and/or outgoing packets on the port are mirrored respectively. Note that this does not include internal traffic which does not enter or exit through the physical port.
— Key rxcounter
— Key txcounter
Optional. Four bit integers (0-15). If set, incoming/outgoing packets will be counted in the selected statistics counter respectively. Multiple apps can share a counter. To retrieve counter statistics use Intel:get_rxstats()
and Intel:get_txstats()
.
— Key rate_limit
Optional. Number. Limits the maximum Mbit/s to transmit. Default is 0 which means no limit. Only applies to outgoing traffic.
— Key priority
Optional. Floating point number. Weight for the round-robin algorithm used to arbitrate transmission when rate_limit is not set or adds up to more than the line rate of the physical port. Default is 1.0 (scaled to the geometric middle of the scale which goes from 1/128 to 128). The absolute value is not relevant, instead only the ratio between competing apps controls their respective bandwidths. Only applies to outgoing traffic.
For example, if two apps without rate_limit set have the same priority, both get the same output bandwidth. If the priorities are 3.0/1.0, the output bandwidth is split 75%/25%. Likewise, 1.0/0.333 or 1.5/0.5 yield the same result.
Note that even a low-priority app can use the whole line rate unless other (higher priority) apps are using up the available bandwidth.
— Key rsskey
Optional. The rsskey is a 32 bit integer that seeds the hash used to distribute packets across queues. If there are multiple levels of RSS snabb devices in the packet flow making this unique will help packet distribution.
— Key wait_for_link
Optional. Boolean that indicates if new
should block until there is a link light or not. The default is false
.
— Key linkup_wait
Optional Number of seconds new
waits for the device to come up. The default is 120.
— Key linkup_wait_recheck
Optional If the linkup_wait
option is true, the number of seconds to sleep between checking the link state again. The default is 0.1 seconds.
— Key mtu
Optional The maximum packet length sent or received, excluding the trailing 4 byte CRC. The default is 9014.
— Key master_stats
Optional Boolean indicating whether to elect an arbitrary app (the master) to collect device statistics. The default is true.
— Key run_stats
Optional Boolean indicating if this app instance should collect device statistics. One per physical NIC (conflicts with master_stats
). There is a small but detectable run time performance hit incurred. The default is false.
— Key mac_loopback
Optional Boolean indicating if the card should operate in “Tx->Rx MAC Loopback mode” for diagnostics or testing purposes. If this is true then wait_for_link
is implicitly false. The default is false.
RSS will distribute packets based on as many of the fields below as are present in the packet:
Packets that are not IPv4 or IPv6 will be delivered to receive queue 0.
Each chipset supports a differing number of receive / transmit queues:
The Intel82599 supports both VMDq and RSS with 32/64 pools and 4/2 RSS queues for each pool. Intel1g i350 supports both VMDq and RSS with 8 pools 2 queues for each pool. Intel1g i210 does not support VMDq.
The Solarflare
app drives one port of a Solarflare SFN7 Ethernet controller. Multiple instances of the Solarflare app can be instantiated on the same PCI device. Packets received from the network will be dispatched between apps based on destination MAC address and VLAN. Packets taken from the rx
port are transmitted onto the network. Packets received from the network are put on the tx
port.
The Solarflare
app requires OpenOnload version 201502 to be installed and the sfc
module to be loaded.
The Solarflare
app accepts a table as its configuration argument. The following keys are defined:
— Key pciaddr
Required. The PCI address of the NIC as a string.
— Key macaddr
Optional. The MAC address to use as a string. The default is a wild-card (e.g. accept all packets).
— Key vlan
Optional. A twelve bit integer (0-4095). If set, incoming packets from other VLANs are dropped and outgoing packets are tagged with a VLAN header.
The RateLimiter
app implements a Token bucket algorithm with a single bucket dropping non-conforming packets. It receives packets on the input
port and transmits conforming packets to the output
port.
— Method RateLimiter:snapshot
Returns throughput statistics in form of a table with the following fields:
rx
- Number of packets receivedtx
- Number of packets transmittedtime
- Current time in nanosecondsThe RateLimiter
app accepts a table as its configuration argument. The following keys are defined:
— Key rate
Required. Rate in bytes per second to which throughput should be limited.
— Key bucket_capacity
Required. Bucket capacity in bytes. Should be equal or greater than rate. Otherwise the effective rate may be limted.
— Key initial_capacity
Optional. Initial bucket capacity in bytes. Defaults to bucket_capacity.
The RateLimiter
app is able to process more than 20 Mpps per CPU core. Refer to its selftest for details.
The PcapFilter
app receives packets on the input
port and transmits conforming packets to the output
port. In order to conform, a packet must match the pcap-filter expression of the PcapFilter
instance and/or belong to a sanctioned connection. For a connection to be sanctioned it must be tracked in a state table by a PcapFilter
app using the same state table. All PcapFilter
apps share a global namespace of state table identifiers. Multiple PcapFilter
apps—e.g. for inbound and outbound traffic—can refer to the same connection by sharing a state table identifer.
The PcapFilter
app accepts a table as its configuration argument. The following keys are available:
— Key filter
Required. A string containing a pcap-filter expression.
— Key state_table
Optional. A string naming a state table. If set, packets passing any rule will be tracked in the specified state table and any packet that belongs to a tracked connection in the specified state table will be let pass.
— Key sessions_established
Total number of sessions established.
The ARP
app implements the Address Resolution Protocol, allowing a Snabb network function to automatically learn the next-hop MAC address for outgoing IPv4 traffic. The ARP
app will also respond to incoming address resolution requests from other hosts on the same network. The next-hop MAC address may also be statically configured. Finally, the Ethernet source address for all outgoing traffic will be set to the self_mac
address configured on the ARP
app.
All of this together means that using the ARP
app in your network function allows you to forget about link-layer concerns, for IPv4 traffic anyway.
Topologically, the ARP
app sits between your network function and the Ethernet interface. Its north
link relays traffic to and from the network function; the south
link talks instead to the Ethernet interface.
The ARP
app accepts a table as its configuration argument. The following keys are defined:
— Key self_mac
Optional. The MAC address of this network function. If not provided, a random MAC address will be generated. Two random MAC addresses have a one-in-nine-million chance of colliding. The ARP app will ensure that all outgoing southbound traffic will originate from this MAC address.
— Key self_ip
Required. The IPv4 address of this host; used to respond to requests and when making ARP requests.
— Key next_mac
Optional. The MAC address to which to send all network traffic. This ARP app currently hsa the limitation that it assumes that all traffic will go to a single MAC address. If this address is provided as part of the configuration, no ARP request will be made; otherwise it will be determined from the next_ip via ARP.
— Key next_ip
Optional. The IPv4 address of the next-hop host. Required only if next_mac is not specified as part of the configuration.
— Key shared_next_mac_key
Optional. Path to a shared memory location (i.e. /var/run/snabb/PID/PATH) in which to store the resolved next_mac. This ARP resolver might be part of a set of peer processes sharing work via RSS. In that case, an ARP response will probably arrive only to one of the RSS processes, not to all of them. If you are using ARP behind RSS, set shared_next_mac_key to, for example, group/arp-next-mac
, to enable the different workers to communicate the next-hop MAC address.
The Reassembler
app is a filter on incoming IPv4 packets that reassembles fragments. Note that Snabb’s internal MTU is 10240 bytes; attempts to reassemble larger packets will fail.
The reassembler has a configurable limit for the reassembly buffer size. If the buffer is full and a new reassembly comes in on the input, the reassembler app will randomly evict a pending reassembly from its buffer before starting the new reassembly.
The reassembler app currently does not time out reassemblies that have been around for too long. It could be a good idea to implement timeouts and then be able to issue “timeout exceeded” ICMP errors if needed.
Finally, note that the reassembler app will pass through any incoming packet that is not IPv4.
The Reassembler
app accepts a table as its configuration argument. The following keys are defined:
— Key max_concurrent_reassemblies
Optional. The maximum number of concurrent reassemblies. Note that each reassembly uses about 11kB of memory. The default is 20000.
— Key max_fragments_per_reassembly
Optional. The maximum number of fragments per reassembly. The default is 40.
The Fragmenter
app that will fragment any IPv4 packets larger than a configured maximum transmission unit (MTU).
The Fragmenter
app accepts a table as its configuration argument. The following key is defined:
— Key mtu
Required. The maximum transmission unit, in bytes, not including the Ethernet header.
The ICMPEcho
app responds to ICMP echo requests (“pings”) to a given set of IPv4 addresses.
Like the ARP
app, ICMPEcho
sits between your network function and outside traffic. Its north
link relays traffic to and from the network function; the south
link talks to the world.
The ICMPEcho
app accepts a table as its configuration argument. The following keys is defined:
— Key address
Optional. An IPv4 address for which to respond to pings, as a uint8_t[4]
.
— Key addresses
Optional. An array of IPv4 addresses for which to respond to pings, as a Lua array of uint8_t[4]
values.
The nd_light
app implements a small subset of IPv6 neighbor discovery (RFC4861). It has two duplex ports, north
and south
. The south
port attaches to a network on which neighbor discovery (ND) must be performed. The north
port attaches to an app that processes IPv6 packets (including full ethernet frames). Packets transmitted to the north
port must be wrapped in full Ethernet frames (which may be empty).
The nd_light
app replies to neighbor solicitations for which it is configured as a target and performs rudimentary address resolution for its configured next-hop address. If address resolution succeeds, the Ethernet headers of packets from the north
port will be overwritten with headers containing the discovered destination address and the configured source address before they are transmitted over the south
port. All packets from the north
port are discarded as long as ND has not yet succeeded. Packets received from the south
port are transmitted to the north
port unaltered.
The nd_light
app accepts a table as its configuration argument. The following keys are defined:
— Key local_mac
Required. Local MAC address as a string or in binary representation.
— Key remote_mac
Optional. MAC address of next_hop address as a string or in binary representation. If this option is present, the nd_light
app does not perform neighbor solicitation for the next_hop address and uses remote_mac as the MAC address associated with next_hop.
— Key local_ip
Required. Local IPv6 address as a string or in binary representation.
— Key next_hop
Required. IPv6 address of next hop as a string or in binary representation.
— Key delay
Optional. Neighbor solicitation retransmission delay in milliseconds. Default is 1,000ms.
— Key retrans
Optional. Number of neighbor solicitation retransmissions. Default is unlimited retransmissions.
— Key quiet
Optional. If set to true, suppress log messages about ND activity. Default is false.
— Key ns_checksum_errors
Neighbor solicitation requests dropped due to invalid ICMP checksum.
— Key ns_target_address_errors
Neighbor solicitation requests dropped due to invalid target address.
— Key na_duplicate_errors
Neighbor advertisement requests dropped because next-hop is already resolved.
— Key na_target_address_errors
Neighbor advertisement requests dropped due to invalid target address.
— Key nd_protocol_errors
Neighbor discovery requests dropped due to protocol errors (invalid IPv6 hop-limit or invalid neighbor solicitation request options).
The SimpleKeyedTunnel
app implements “a simple L2 Ethernet over IPv6 tunnel encapsulation” as described in Keyed IPv6 Tunnel. It has two duplex ports, encapsulated
and decapsulated
. Packets transmitted on the decapsulated
input port will be encapsulated and put on the encapsulated
output port. Packets transmitted on the encapsulated
input port will be decapsulated and put on the decapsulated
output port.
The SimpleKeyedTunnel
app accepts a table as its configuration argument. The following keys are defined:
— Key local_address
Required. Local IPv6 address as a string.
— Key remote_address
Required. Remote IPv6 address as a string.
— Key local_cookie
Required. Local cookie, 8 bytes encoded in a hexadecimal string.
— Key remote_cookie
Required. Remote cookie, 8 bytes encoded in a hexadecimal string.
— Key local_session
Optional. Unsigned integer, 32 bit. If set, the session_id
field of the L2TPv3 header will be overwritten with this value.
— Key hop_limit
Optional. Unsigned integer. Sets the hop limit. Default is 64.
— Key default_gateway_MAC
Optional. Destination MAC as a string. Not required if overwritten by an app such as nd_light
.
— Key length_errors
Ingress packets dropped due to invalid length (packet too short).
— Key protocol_errors
Ingress packets dropped due to unrecognized IPv6 protocol ID.
— Key cookie_errors
Ingress packets dropped due to wrong cookie value.
— Key remote_address_errors
Ingress packets dropped due to wrong remote IPv6 endpoint address.
— Key local_address_errors
Ingress packets dropped due to wrong local IPv6 endpoint address.
The Fragmenter
app will fragment any IPv6 packets larger than a configured maximum transmission unit (MTU) or the dynamically discovered MTU on the network path (PMTU) towards a specific destination, depending on the setting of the pmtud configuration option.
If path MTU discovery (PMTUD) is disabled, the app expects to receive packets on its input
link and sends (possibly fragmented) packets to its output
link
If PMTUD is enabled, the app also expects to process packets in the reverse direction in order to be able to intercept and interpret ICMP packets of type 2, code 0. Those packets, known as “Packet Too Big” (PTB) messages, contain reports from nodes on the path towards a particular destination, which indicate that a previously sent packet could not be forwarded due to a MTU bottleneck. The message contains the MTU in question as well as at least the header of the original packet that triggered the PTB message. The Fragmenter
app extracts the destination address from the original packet and stores the MTU in a per-destination cache as the PMTU for that address.
Apart from checking the integrity of the ICMP message, the app can optionally also verify whether the message is actually intended for consumption by this instance of the Fragmenter
app. For that purpose, the app can be configured with an exhaustive list of IPv6 addresses that are designated to be local to the system. When a PTB message is received, it is checked whether the destination address of the ICMP message as well as the source address of the embedded original packet are contained in this list. The message is discarded if this condition is not met. No such checking is performed if the list is empty.
When the Fragmenter
receives a packet on the input
link, it first consults the per-destination cache. In case of a hit, the PMTU from the cache takes precedence over the statically configured MTU.
A PMTU is removed from the cache after a configurable timeout to allow the system to discover a larger PMTU, e.g. after a change in network topology.
With PMTUD enabled, the app has two additional links, called north
and south
All packets received on the south
link which are not ICMP packets of type 2, code 0 are passed on unmodified on the north
link.
The Fragmenter
app accepts a table as its configuration argument. The following keys are defined:
— Key mtu
Required. The maximum transmission unit, in bytes, not including the Ethernet header.
— Key pmtud
Optional. If set to true
, dynamic path MTU discovery (PMTUD) is enabled. The default is false
.
— Key pmtu_timeout
Optional. The amount of time in seconds after which a PMTU is removed from the cache. The default is 600. This key is ignored unless pmtud is true
.
— Key pmtu_local_addresses
Optional. A table of IPv6 addresses in human readable representation for which the app will accept PTB messages. The default is an empty table, which disables the check for local addresses.
The ICMPEcho
app responds to ICMP echo requests (“pings”) to a given set of IPv6 addresses.
Like the ARP
app, ICMPEcho
sits between your network function and outside traffic. Its north
link relays traffic to and from the network function; the south
link talks to the world.
The ICMPEcho
app accepts a table as its configuration argument. The following keys is defined:
— Key address
Optional. An IPv6 address for which to respond to pings, as a uint8_t[16]
.
— Key addresses
Optional. An array of IPv6 addresses for which to respond to pings, as a Lua array of uint8_t[16]
values.
The VhostUser
app implements portions of the Virtio protocol for virtual ethernet I/O interfaces. In particular, VhostUser
supports the virtio vring data structure for packet I/O in shared memory (DMA) and the Linux vhost API for creating vrings attached to tuntap devices.
With VhostUser
SnabbSwitch can be used as a virtual ethernet interface by QEMU virtual machines. When connected via a UNIX socket, packets can be sent to the virtual machine by transmitting them on the rx
port and packets sent by the virtual machine will arrive on the tx
port.
The VhostUser
app accepts a table as its configuration argument. The following keys are defined:
— Key socket_path
Optional. A string denoting the path to the UNIX socket to connect on. Unless given all incoming packets will be dropped.
— Key is_server
Optional. Listen and accept an incoming connection on socket_path instead of connecting to it.
The VirtioNet
app implements a subset of the driver part of the virtio-net specification. It can connect to a virtio-net device from within a QEMU virtual machine. Packets can be sent out of the virtual machine by transmitting them on the rx
port, and packets sent to the virtual machine will arrive on the tx
port.
The VirtioNet
app accepts a table as its configuration argument. The following keys are defined:
— Key pciaddr
Required. The PCI address of the virtio-net device.
— Key use_checksum
Optional. Boolean value to enable the checksum offloading pre-calculations applied on IPv4/IPv6 TCP and UDP packets.
The PcapReader
and PcapWriter
apps can be used to inject and log raw packet data into and out of the app network using the Libpcap File Format. PcapReader
reads raw packets from a PCAP file and transmits them on its output
port while PcapWriter
writes packets received on its input
port to a PCAP file.
Both PcapReader
and PcapWriter
expect a filename string as their configuration arguments to read from and write to respectively. PcapWriter
will alternatively accept an array as its configuration argument, with the first element being the filename and the second element being a mode argument to io.open
.
The Tap
app is a simple in-band packet tap that writes packets that it sees to a pcap savefile. It can optionally only write packets that pass a pcap filter, and optionally subsample so it can write only every /n/th packet.
The Tap
app accepts a table as its configuration argument. The following keys are defined:
— Key filename
Required. The name of the file to which to write the packets.
— Key mode
Optional. Either "truncate"
or "append"
, indicating whether the savefile will be truncated (the default) or appended to.
— Key filter
Optional. A pflang filter expression to select packets for tapping. Only packets that pass this filter will be sampled for the packet tap.
— Key sample
Optional. A sampling period. Defaults to 1, indicating that every packet seen by the tap and passing the optional filter string will be written. Setting this value to 2 will capture every second packet, and so on.
The RawSocket
app is a bridge between Linux network interfaces (eth0
, lo
, etc.) and a Snabb app network. Packets taken from the rx
port are transmitted over the selected interface. Packets received on the interface are put on the tx
port.
The RawSocket
app accepts a string as its configuration argument. The string denotes the interface to bridge to.
The UnixSocket
app provides I/O for a named Unix socket.
The UnixSocket
app takes a string argument which denotes the Unix socket file name to open, or a table with the fields:
filename
- the Unix socket file name to open.listen
- if true
, listen for incoming connections on the socket rather than connecting to the socket in client mode.mode
- can be “stream” or “packet” (the default is “stream”): the difference is that in packet mode, the packets are not split or merged (in both modes packets arrive in order).NOTE: The socket is not opened until the first call to push() or pull(). If connection is lost, the socket will be re-opened on the next call to push() or pull().
The Tap
app is used to interact with a Linux tap device. Packets transmitted on the input
port will be sent over the tap device, and packets that arrive on the tap device can be received on the output
port.
This app accepts either a single string or a table as its configuration argument. A single string is equivalent to the default configuration with the name
attribute set to the string.
— Key name
Required. The name of the tap device.
If the device does not exist yet, which is inferred from the absence of the directory /sys/class/net/
name, it will be created by the app and removed when the process terminates. Such a device is called ephemeral and its operational state is set to up after creation.
If the device already exists, it is called persistent. The app can attach to a persistent tap device and detaches from it when it terminates. The operational state is not changed. By default, the MTU is also not changed by the app, see the mtu_set option below.
One manner in which a persistent tap device can be created is by using the ip
tool
ip tuntap add Tap345 mode tap
ip link set up dev Tap345
ip link set address 02:01:02:03:04:08 dev Tap0
— Key mtu
Optional. The L2 MTU of the device. The default is 1514.
By definition, the L2 MTU includes the size of the L2 header, e.g. 14 bytes in case of Ethernet without VLANs. However, the Linux ioctl
methods only expose the L3 (IP) MTU, which does not include the L2 header. The following configuration options are used to correct this discrepancy.
— Key mtu_fixup
Optional. A boolean that indicates whether the mtu option should be corrected for the difference between the L2 and L3 MTU. The default is true.
— Key mtu_offset
Optional. The value by which the mtu is reduced when mtu_fixup is set to true. The default is 14.
The resulting MTU is called the effective MTU.
— Key mtu_set
Optional. Either nil or a boolean that indicates whether the MTU of the tap device should be set or checked. If mtu_set is true, the MTU of the tap device is set to the effective MTU. If mtu_set is false, the effective MTU is compared with the current value of the MTU of the tap device and an error is raised in case of a mismatch.
If mtu_set is nil, the MTU is set or checked if the tap device is ephemeral or persistent, respectively. The rationale is that if the device is persistent, the entity that created the device is responsible for the configuration and might not expect or react well to a change of the MTU.
There are three VLAN related apps, Tagger
, Untagger
and VlanMux
. The Tagger
and Untagger
apps add or remove a VLAN tag whereas the VlanMux
app can multiplex and demultiplex packets to different output ports based on tag.
The Tagger
app adds a VLAN tag, with the configured value and encapsulation, to packets received on its input
port and transmits them on its output
port.
— Key encapsulation
Optional. The Ethertype to use as encapsulation for the VLAN. Permitted values are the strings “dot1q” and “dot1ad” or a number to select an arbitrary Ethertype. “dot1q” and “dot1ad” correspond to the Ethertypes 0x8100 and 0x88a8, respectively, according to the IEEE standards 802.1Q and 802.1ad.
If a number is given, it is truncated to 16 bits. This feature is intended to allow interoperation with vendors that do not use one of the standard encapsulations (a prominent example being the value 0x9100, which can still be found in practice for double-tagging instead of 0x88a8).
The default is “dot1q”.
— Key tag
Required. VLAN tag to add or remove from the packet. The value must be a number in the range 1-4094 (inclusive).
The Untagger
app checks packets received on its input
port for a VLAN tag, removes it if it matches with the configured VLAN tag and transmits them on its output
port. Packets with other VLAN tags than the configured tag are dropped.
— Key encapsulation
Optional. See above.
— Key tag
Required. VLAN tag to add or remove from the packet. The value must be a number in the range 1-4094 (inclusive).
Despite the name, the VlanMux
app can act both as a multiplexer, i.e. receive packets from multiple different input ports, add a VLAN tag and transmit them out onto one, as well as receiving packets from its trunk
port and demultiplex it over many output ports based on the VLAN tag of the received packet. It supports the notion of a “native VLAN” by mapping untagged frames on the trunk port to a dedicated output port.
A packet received on its trunk
input port must either be untagged or tagged with the encapsulation as specified with the encapsulation configuration option. Otherwise, the packet is dropped.
If the Ethernet frame is tagged, the VLAN ID is extracted and the packet is transmitted on the port named vlan<vid>
, where <vid>
is the decimal representation of the VLAN ID. If no such port exists, the packet is dropped.
If the Ethernet frame is untagged, it is transmitted on the port named native
or dropped if no such port exists.
A packet received on a port named vlan<vid>
is tagged with the VLAN ID <vid>
according to the configured encapsulation and transmitted on the trunk port.
A packet received on the port named native
is transmitted as is on the trunk port.
— Key encapsulation
Optional. See above.
A bridge
app implements a basic Ethernet bridge with split-horizon semantics. It has an arbitrary number of ports. For each input port there must exist an output port with the same name. Each port name is a member of at most one split-horizon group. If it is not a member of a split-horizon group, the port is also called a free port. Packets arriving on a free input port may be forwarded to all other output ports. Packets arriving on an input port that belongs to a split-horizon group are never forwarded to any output port belonging to the same split-horizon group. There are two bridge
implementations available: apps.bridge.flooding
and apps.bridge.learning
.
A bridge
app accepts a table as its configuration argument. The following keys are defined:
— Key ports
Optional. An array of free port names. The default is no free ports.
— Key split_horizon_groups
Optional. A table mapping split-horizon groups to arrays of port names. The default is no split-horizon groups.
— Key config
Optional. The configuration of the actual bridge implementation.
The flooding bridge
app implements the simplest possible bridge, which floods a packet arriving on an input port to all output ports within its scope according to the split-horizon topology.
The flooding bridge
app ignores the config key of its configuration.
The learning bridge
app implements a learning bridge using a custom hash table to store the set of MAC source addresses of packets arriving on each input port. When a packet is received it is forwarded to all output ports whose corresponding input ports match the packet’s destination MAC address. When no input port matches, the packet is flooded to all output ports. Multicast MAC addresses are always flooded to all output ports associated with the input port. The scoping rules according to the split-horizon topology apply unchanged.
The learning bridge
app accepts a table as the value of the config key of its configuration. The following keys are defined:
— Key mac_table
Optional. This is a table that defines the characteristics of the MAC table. The following keys are defined
— Key size
Optional. The number of MAC addresses to be stored in the table. Default is 256. The size of the table is increased automatically if this limit is reached or if an overflow in one of the hash buckets occurs. This value is capped by resize_max.
— Key timeout
Optional. Timeout for learned MAC addresses in seconds. Default is 60.
— Key verbose
Optional. A boolean value. If true, statistics about table usage is logged during each timeout interval. Default is false
.
— Key copy_on_resize
Optional. A boolean value. If true, the contents of the table is copied to the newly allocated table after a resize operation. Default is true
.
— Key resize_max
Optional. An upper bound for the size of the table. Default is 65536.
The IPFIX
app implements an RFC 7011 IPFIX “meter” and “exporter” that records the flows present in incoming traffic and sends exported UDP packets describing those flows to an external collector (not included). The exporter can produce output in either the standard RFC 7011 IPFIX format, or the older NetFlow v9 format from RFC 3954.
See the snabb ipfix probe
command-line interface for a program built using this app.
The IPFIX
app accepts a table as its configuration argument. The following keys are defined:
— Key idle_timeout
Optional. Number of seconds after which a flow should be considered idle and available for expiry. The default is 300 seconds.
— Key active_timeout
Optional. Period at which an active, non-idle flow should produce export records. The default is 120 seconds.
— Key cache_size
Optional. Initial size of flow tables, in terms of number of flows. The default is 20000.
— Key template_refresh_interval
Optional. Period at which to send template records over UDP. The default is 600 seconds.
— Key ipfix_version
Optional. Version of IPFIX to export. 9 indicates legacy NetFlow v9; 10 indicates RFC 7011 IPFIX. The default is 10.
— Key mtu
Optional. MTU for exported UDP packets. The default is 512.
— Key observation_domain
Optional. Observation domain tag to attach to all exported packets. The default is 256.
— Key exporter_ip
Required, sadly. The IPv4 address from which to send exported UDP packets.
— Key collector_ip
Required. The IPv4 address to which to send exported UDP packets.
— Key collector_port
Required. The port on which the collector is listening for UDP packets.
— Key templates
Optional. The templates for flows being collected. See the source code for more information.
Some ideas for things to hack on are below.
As it is, if an attacker can create millions of flows, then our flow set will expand to match (and never shrink). Perhaps we should cap the total size of the flow table.
For large ctables, we can only do 7 or 8 million lookups per second if we look up one key after another. However if we do lookups in parallel, then we can get 15 million or so, which would allow us to reach 10Gbps line rate on 64-byte packets.
We should try to model the configuration of the IPFIX app with a YANG schema. See RFC 6728 for some inspiration.
The links that we use as internal buffers between parts of the IPFIX app have some overhead as they have to update counters. Perhaps we should use a special-purpose data structure.
Currently internal flow start and end times use UNIX time. This isn’t great for timers, but it does match what’s specified in RFC 7011. Could we switch to monotonic time?
We can collect IPv6 flows of course, but we only export to collectors over IPv4 for the moment.
Right now, routing a packet towards a flow set means no other flow set can measure that packet. Perhaps this should change.
The Transport6
and Tunnel6
apps implement ESP in transport and tunnel mode respectively. they encrypts packets received on their decapsulated
port and transmit them on their encapsulated
port, and vice-versa. Packets arriving on the decapsulated
port must have Ethernet and IPv6 headers, and packets arriving on the encapsulated
port must have an Ethernet and IPv6 headers followed by an ESP header, otherwise they will be discarded.
References:
lib.ipsec.esp
The Transport6
and Tunnel6
apps accepts a table as its configuration argument. The following keys are defined:
— Key self_ip (Tunnel6
only)
Required. Source address of the encapsulating IPv6 header.
— Key nexthop_ip (Tunnel6
only)
Required. Destination address of the encapsulating IPv6 header.
— Key aead
Optional. The identifier of the AEAD to use for encryption and authentication. For now, only the default "aes-gcm-16-icv"
(AES-GCM with a 16 octet ICV) is supported.
— Key spi
Required. A 32 bit integer denoting the “Security Parameters Index” as specified in RFC 4303.
— Key transmit_key
Required. Hexadecimal string of 32 digits (two digits for each byte) that denotes a 128-bit AES key as specified in RFC 4106 used for the encryption of outgoing packets.
— Key transmit_salt
Required. Hexadecimal string of eight digits (two digits for each byte) that denotes four bytes of salt as specified in RFC 4106 used for the encryption of outgoing packets.
— Key receive_key
Required. Hexadecimal string of 32 digits (two digits for each byte) that denotes a 128-bit AES key as specified in RFC 4106 used for the decryption of incoming packets.
— Key receive_salt
Required. Hexadecimal string of eight digits (two digits for each byte) that denotes four bytes of salt as specified in RFC 4106 used for the decryption of incoming packets.
— Key receive_window
Optional. Minimum width of the window in which out of order packets are accepted as specified in RFC 4303. The default is 128.
— Key resync_threshold
Optional. Number of consecutive packets allowed to fail decapsulation before attempting “Re-synchronization” as specified in RFC 4303. The default is 1024.
— Key resync_attempts
Optional. Number of attempts to re-synchronize a packet that triggered “Re-synchronization” as specified in RFC 4303. The default is 8.
— Key auditing
Optional. A boolean value indicating whether to enable or disable “Auditing” as specified in RFC 4303. The default is nil
(no auditing).
The Match
app compares packets received on its input port rx
with those received on the reference input port comparator
, and reports mismatches as well as packets from comparator
that were not matched.
— Method Match:errors
Returns the recorded errors as an array of strings.
The Match
app accepts a table as its configuration argument. The following keys are defined:
— Key fuzzy
Optional. If this key is true
packets from rx
that do not match the next packet from comparator
are ignored. The default is false
.
— Key modest
Optional. If this key is true
unmatched packets from comparator
are ignored if at least one packet from ´rx´ was successfully matched. The default is false
.
The Synth
app generates synthetic packets with Ethernet headers and alternating payload sizes. On each breath it fills each attached output link with new packets.
The Synth
app accepts a table as its configuration argument. The following keys are defined:
— Key src
— Key dst
Source and destination MAC addresses in human readable from. The default is "00:00:00:00:00:00"
.
— Key sizes
An array of numbers designating the packet payload sizes. The default is {64}
.
— Key random_payload
Generate a random payload for each packet in sizes
.
— Key packet_id
Insert the packet number (32bit uint) directly after the ethertype. The packet number starts at 0 and is sequential on each output link.
The Npackets
app allows are most N packets to flow through it. Any further packets are never dequeued from input.
input -> | Npackets | -> output +———–+
The Npackets
app accepts a table as its configuration argument. The following keys are defined:
— Key npackets The number of packets to forward, further packets are never dequeued from input.
The L7Spy
app is a Snabb app that scans packets passing through it using an instance of the Scanner
class. The scanner instance may be shared among several L7Spy
instances or with a L7Fw
app for filtering.
— Method L7Spy:new config
Construct a new L7Spy
app instance based on a given configuration table. The table may contain the following key:
scanner
(optional): Either a string identifying the kind of scanner to construct (currently only "ndpi"
is accepted) or an existing scanner instance.The L7Fw
app implements a stateful firewall by querying the scanner state collected by a L7Spy
app. It then filters packets based on a given set of rules.
— Method L7Fw:new config
Construct a new L7Fw
app instance based on a given configuration table. The table may contain the following keys:
scanner
: A Scanner
instance shared with an L7Spy
instance. The metadata in this scanner is used for packet filtering.rules
: A table mapping protocol names (as strings) to firewall actions. The accepted actions are "accept"
, "reject"
, "drop"
, or a pfmatch expression. The pfmatch expression may use the variable flow_count
(as an arithmetic expression) to refer to the number of packets in a given protocol flow, and may call the accept
, reject
, or drop
methods.local_ipv4
(optional): An IPv4 address that identifies the host running the firewall. This is used as the source address in ICMPv4 or TCP reject responses.local_ipv6
(optional): An IPv6 address that identifies the host running the firewall. This is used as the source address in ICMPv6 or TCP reject responses.local_macaddr
(optional): A MAC address that identifies the host running the firewall. This is used for the source address in ethernet frames for reject responses.logging
(optional): A log level parameter that can be set to “on” or “off”. When set to “on”, it will report dropped/rejected packets to the system log.Scanner
objects are responsible for:
The class is not meant to be instantiated directly, but to be used as the basis for concrete implementations (e.g. NdpiScanner
). It provides one function that subclasses can use:
Extracts fields from the headers of an IPv4 or IPv6 packet. The returned values are:
Key objects contain some of the returned information in a compact FFI representation, and can be used as an aid to uniquely identify a flow of packets. The provide the following attributes:
:eth_type()
: Method which returns the type of the Ethernet frame payload, either ETH_TYPE_IPv4
or ETH_TYPE_IPv6
.:hash()
: Method which returns an integer calculated by hashing all the other values in the key object..vlan_id
: VLAN identifier. Zero for no VLAN tags..ip_proto
: The IP protocol..lo_addr
and .hi_addr
: IP addresses (either v4 or v6)..lo_port
and .hi_port
: For TCP and UDP, the ports as big-endian (network) integers.This method can be very useful to implement scanners using backends which do not implement their own flow classification.
All the Scanner
implementations conform to the Scanner
base API.
— Method Scanner:scan_packet packet, time
Scans a packet.
The time parameter is used to know at which time (in seconds from the Epoch) packet has been received for processing. A suitable value can be obtained using engine.now()
.
— Method Scanner:get_flow packet
Obtains the traffic flow for a given packet. If the packet is determined to not match any of the detected flows, nil
is returned. The returned flow object has at least the following fields:
protocol
: The L7 protocol for the flow. A user-visible string can be obtained by passing this value to Scanner:protocol_name()
.packets
: Number of packets scanned which belong to the traffic flow.last_seen
: Last time (in seconds from the Epoch) at which a packet belonging to the flow has been scanned.— Method Scanner:flows
Returns an iterator over all the traffic flows detected by the scanner. The returned value is suitable to be used in a for
-loop:
for flow in my_scanner:flows() do
-- Do something with "flow".
end
— Method Scanner:protocol_name protocol
Given a protocol identifier, returns a user-friendly name as a string. Typically the protocol is obtained flow objects returned by Scanner:get_flow()
.
NdpiScanner
uses the nDPI library (via the ljndpi FFI binding) to scan packets and determine L7 traffic flows. The nDPI library (libndpi.so
) must be available in the host system. Versions 1.7 and 1.8 are supported.
— Method NdpiScanner:new ticks_per_second
Creates a new scanner, with a ticks_per_second resolution.
The apps.wall.util
module contains miscellaneous utilities.
— Function util.ipv4_addr_cmp a, b
Compares two IPv4 addresses a and b. The returned value follows the same convention as for C.memcmp()
: zero if both addresses are equal, or an integer value with the same sign as the sign of the difference between the first pair of bytes that differ in a and b.
— Function util.ipv6_addr_cmp a, b
Compares two IPv6 addresses a and b. The returned value follows the same convention as for C.memcmp()
: zero if both addresses are equal, or an integer value with the same sign as the sign of the difference between the first pair of bytes that differ in a and b.
The SouthAndNorth
application is not to mean to be used directly, but rather as a building block for more complex applications which need two duplex ports (south
and north
) which forward packets between them, optionally doing some intermediate processing.
Packets arriving to the north
port are passed to the :on_southbound_packet()
method —which can be overriden in a subclass—, and forwarded to the south
port. Conversely, packets arriving to the south
port are passed to :on_northbound_packet()
method, and finally forwarded to the north
port.
The value returnbyed :on_southbound_packet()
and :on_northbound_packet()
determines what will be done to the packet being processed:
false
discards the packet: the packet will not be forwarded, and packet.free()
will be called on it.packet.free()
called on it, and the returned packet is forwarded.nil
achieves the same effect.The following snippet defines an application derived from SouthAndNorth
which silently discards packets bigger than a certain size, and keeps a count of how many packets have been discarded and forwarded:
-- Setting SouthAndNorth as metatable "inherits" from it.
DiscardBigPackets = setmetatable({},
require("apps.wall.util").SouthAndNorth)
function DiscardBigPackets:new (max_length)
return setmetatable({
max_packet_length = max_length,
discarded_packets = 0,
forwarded_packets = 0,
}, self)
end
function DiscardBigPackets:on_northbound_packet (pkt)
if pkt.length > self.max_packet_length then
self.discarded_packets = self.discarded_packets + 1
return false
end
self.forwarded_packets = self.forwarded_packets + 1
end
-- Apply the same logic for packets in the other direction.
DiscardBigPackets.on_southbound_packet =
DiscardBigPackets.on_northbound_packet
The rss
app implements the basic functionality needed to provide generic receive side scaling to other apps. In essence, the rss
app takes packets from an arbitrary number n
of input links and distributes them to an arbitrary number m
of output links
The distribution algorithm has the property that all packets belonging to the same flow are guaranteed to be mapped to the same output link, where a flow is identified by the value of certain fields of the packet header, depending on the type of packet.
For IPv4 and IPv6, the basic classifier is given by the 3-tuple (source address
, destination address
, protocol
), where protocol
is the value of the protocol field of the IPv4 header or the value of the next-header field that identifies the “upper-layer protocol” of the IPv6 header (which may be preceeded by any number of extension headers).
If the protocol is either TCP (protocol #6), UDP (protocol #17) or SCTP (protocol #132), the list of header fields is augmented by the port numbers to yield the 5-tuple (source address
, destination address
, protocol
, source port
, destination port
).
The output link is determined by applying a hash function to the set of header fields
out_link = ( hash(flow_fields) % m ) + 1
All other packets are not classified into flows and are always mapped to the first output link.
The actual scaling property is achieved by running the receivers in separate processes and use specialized inter-process links to connect them to the rss
app.
In addition to this basic functionality, the rss
app also implements the following set of extensions.
The output links can be grouped into equivalence classes with respect to matching conditions in terms of arbitrary pflang expressions as provided by the pf
module. Matching packets are only distributed to the output links that belong to the equivalence class. By default, a single equivalence class exists which matches all packets. It is special in the sense that the matching condition cannot be expressed in pflang. This default class is the only one that can receive non-IP packets.
Classes are specified in an explicit order when an instance of the rss
app is created. The default class is created implicitly as the last element in the list. Each packet is matched against the filter expressions, starting with the first one. If a match is found, the packet is assigned to the corresponding equivalence class and processing of the list stops.
The default class can be disabled by configuration. In that case, packets not assigned to any class are dropped.
The standard flow-director assigns a packet to at most one class. Any class can also be marked with the attribute continue
to allow matches to multiple classes. When a packet is matched to such a class, it is distributed to the set of ouput links associated with that class but processing of the remaining filter expressions continues. If the packet matches a subsequent class, a copy is created and distributed to the corresponding set of output links. Processing stops when the packet matches a class that does not have the continue
attribute.
By default, all output links in a class are treated the same. In other words, if the input consists of a sufficiently large sample of random flows, all links will receive about the same share of them. It is possible to introduce a bias for certain links by assigning a weight to them, given by a positive integer w
. If the number of links is m
and the weight of link i
(1 <= i <= m
) is w_i
, the share of traffic received by it is given by
share_i = w_i/(w_1 + w_2 + ... + w_m)
For example, if m = 2
and w_1 = 1, w_2 = 2
, link #1 will get 1/3 and link #2 will get 2/3 of the traffic.
In order to compute the hash over the header fields, the rss
app must parse the packets to a certain extent. Internally, the result of this analysis is appended as a block of data to the end of the actual packet data. Because this data can be useful to other apps downstream of the rss
app, it is exposed as part of the API.
The meta-data is structured as follows
struct {
uint16_t magic;
uint16_t ethertype;
uint16_t vlan;
uint16_t total_length;
uint8_t *filter_start;
uint16_t filter_length;
uint8_t *l3;
uint8_t *l4;
uint16_t filter_offset;
uint16_t l3_offset;
uint16_t l4_offset;
uint8_t proto;
uint8_t frag_offset;
int16_t length_delta;
}
magic
This field contains the constant 0x5abb
to mark the start of a valid meta-data block. The get API function asserts that this value is correct.
ethertype
This is the Ethertype contained in the Ethernet header of the packet. If the frame is of type 802.1q, i.e. the Ethertype is 0x8100
, the ethertype
field is set to the effective Ethertype following the 802.1q header. Only one level of tagging is recognised, i.e. for double-tagged frames, ethertype
will contain the value 0x8100
.
vlan
If the frame contains a 802.1q tag, vlan
is set to the value of the VID
field of the 802.1q header. Otherwise it is set to 0.
total_length
If ethertype
identifies the frame as either a IPv4 or IPv6 packet (i.e. the values 0x0800
and 0x86dd
, respectively), total_length
is the size of the L3 payload of the Ethernet frame according to the L3 header, including the L3 header itself. For IPv4, this is the value of the header’s Total Length field. For IPv6, it is the sum of the header’s Payload Length field and the size of the basic header (40 bytes).
For all other values of ethertype
, total_length
is set to the effective size of the packet (according to the length
field of the packet
data structure) minus the the size of the Ethernet header (14 bytes for untagged frames and 18 bytes for 802.1q tagged frames).
filter_start
This is a pointer into the packet that can be passed as first argument to a BPF matching function generated by pf.compile_filter.
For untagged frames, this is a pointer to the proper Ethernet header.
For 802.1q tagged frames, an offset of 4 bytes is added to skip the 802.1q header. The reason for this is that the pf
module does not implement the vlan
primitive of the standard BPF syntax. The additional 4-byte offset places the effective Ethertype (i.e. the same value as in the ethertype
meta-data field) at the position of an untagged Ethernet frame. Note that this makes the original MAC addresses unavailable to the filter.
filter_length
This value is the size of the chunk of data pointed to by filter_start
and can be passed as second argument to a BPF matching function generated by pf.compile_filter. It is equal to the size of the packet if the frame is untagged or 4 bytes less than that if the frame is 802.1q tagged.
l3
This is a pointer to the start of the L3 header in the packet.
l4
This is a pointer to the start of the L4 header in the packet. For IPv4 and IPv6, it points to the first byte following the L3 header. For all other packets, it is equal to l3
.
filter_offset
, l3_offset
, l4_offset
These values are the offsets of filter_start
, l3
, and l4
relative to the start of the packet. They are used by the copy API call to re-calculate the pointers after the meta-data block has been relocated.
proto
For IPv4 and IPv6, the proto
field contains the identifier of the upper layer protocol carried in the payload of the packet. For all other packets, its value is undefined.
For IPv4, the upper layer protocol is given by the value of the Protocol field of the header. For IPv6, it is the value of the Next Header field of the last extension header in the packet’s header chain. The rss
app recognizes the following protocol identifiers as extension headers according to the IANA ipv6-parameters registry
Note that the protocols 50 (Encapsulating Security Payload, ESP), 253 and 254 (reserved for experimentation and testing) are treated as upper layer protocols, even though, technically, they are classified as extension headers.
frag_offset
For fragmented IPv4 and IPv6 packets, the frag_offset
field contains the offset of the fragment in the original packet’s payload in 8-byte units. A value of zero indicates that the packet is either not fragmented at all or is the initial fragment.
For non-IP packets, the value is undefined.
length_delta
This field contains the difference of the packet’s effective length (as given by the length
field of the packet data structure) and the size of the packet calculated from the IP header, i.e. the sum of l3_offset
and total_length
. For a regular packet, this difference is zero.
A negative value indicates that the packet has been truncated. A typical scenario where this is expected to occur is a setup involving a port-mirror that truncates packets either due to explicit configuration or due to a hardware limitation. The length_delta
field can be used by a downstream app to determine whether it has received a complete packet.
A positive value indicates that the packet contains additional data which is not part of the protocol data unit. This is not expected to occur under normal circumstances. However, it has been observed that some devices perform this kind of padding when port-mirroring is configured with packet truncation and the mirrored packet is smaller than the truncation limit.
For non-IP packets, length_delta
is always zero.
The pf
module does not implement the protochain
primitive for IPv6. The only extension header it can deal with is the fragmentation header (protocol 44). As a consequence, packets containing arbitrary extension headers can not be matched against filter expressions.
To overcome this limitation, the meta-data generator of the rss
app removes all extension headers from a packet by default, leaving only the basic IPv6 header followed immediately by the upper layer protocol. The values of the Payload Length and Next Header fields of the basic IPv6 header as well as the packet length are adjusted accordingly.
Since the rss
app can accept packets from multiple sources, the information on which link the packet was received is not trivially available to receiving apps unless the packets contain a unique identifier of some sort, e.g. a particular VLAN tag. If such an identifier is not available, the rss
app can be configured to attach a pseudo VLAN tag to packets arriving on a particular input link. It is called “pseudo tagging” because the VLAN is only added to the packet’s meta-data, not the packet itself. As a consequence, a receiving app only sees this kind of tag when it examines the meta-data provided by the rss
app. Such a pseudo-tag also overrides any native VLAN tag that a packet might have.
The pseudo-tagging is enabled by following a convention for the naming of input links as described below.
If proper VLAN tagging is required, the vlan.vlan.Tagger
app can be pushed between the packet source and the input link.
The rss
app accepts the following arguments.
— Key default_class
Optional. A boolean that specifies whether the default filter class should be enabled. The default is true
. The name of the default class is default.
— Key classes
Optional. An ordered list of class specifications. Each specification must be a table with the following keys.
Key name
Required. The name of the class. It must be unique among all classes and it must match the Lua regular expression %w+
.
Key filter
Required. A string containing a pflang filter expression.
Key continue
Optional. A boolean that specifies whether processing of classes should continue if a packet has matched the filter of this class. The default is false
.
— Key remove_extension_headers
Optional. A boolean that specifies whether IPv6 extension headers shoud be removed from packets. The default is true
.
The classes configuration option specifies the set of classes known to an instance of the rss
app. The assignment of links to classes is done implicitly by connecting other apps using the convention <class>_<instance>
for the name of the links, where <class>
is the name of the class to which the links should be assigned exactly as specified by the name parameter of the class definition. The <instance>
specifier can be any string (adhering to the naming convention for links) that distinguishes the links within a class.
If the instance specifier is formatted as <instance>_<weight>
, where <instance>
is restricted to the pattern %w+
and <weight>
must be a number, the link’s weight is set to the value <weight>
. The default weight for a links is 1.
If the rss
app detects an output link whose name does not match any of the configured classes, it issues a warning message and ignores the link. Classes to which no output links are assigned are ignored.
The names of the input links are arbitrary unless the VLAN pseudo-tagging feature should be used. In that case, the link must be named vlan<vlan-id>
, where <vlan-id>
must be a number between 1 and 4094 and will be placed in the <vlan>
meta-data field of every packet received on the link (irrespective of whether the packet has a real VLAN ID or not).
The meta-data functionality is provided by the module apps.rss.metadata
and provides the following API.
— Function add packet, remove_extension_headers, vlan
Analyzes packet and adds a meta-data block starting immediately after the packet data. If the boolean remove_extension_headers is true
, IPv6 extension headers are removed from the packet. The optional vlan overrides the value of the vlan
meta-data field extracted from the packet, irrespective of whether the packet actually has a tag or not.
An error is raised if there is not enough room for the mata-data block in the packet.
— Function get packet
Returns a pointer to the meta-data in packet. An error is raised if the meta-data block does not start with the magic number (0x5abb
).
— Function copy packet
Creates a copy of packet including the meta-data block. Returns a pointer to the new packet.
The “interlink” transmitter and receiver apps allow for efficient exchange of packets between Snabb processes within the same process group (see Multiprocess operation (core.worker)).
To make packets from an output port available to other processes, configure a transmitter app, and link the appropriate output port to its input
port.
local Transmitter = require("apps.interlink.transmitter")
config.app(c, "interlink", Transmitter)
config.link(c, "myapp.output -> interlink.input")
Then, in the process that should receive the packets, configure a receiver app with the same name, and link its output
port as suitable.
local Receiver = require("apps.interlink.receiver")
config.app(c, "interlink", Receiver)
config.link(c, "interlink.output -> otherapp.input")
Subsequently, packets transmitted to the transmitter’s input
port will appear on the receiver’s output
port.
Alternatively, a name can be supplied as a configuration argument to be used instead of the app’s name:
config.app(c, "mylink", Receiver, "interlink")
config.link(c, "mylink.output -> otherapp.input")
The configured app names denote globally unique queues within the process group. Alternativelyy, the receiver and transmitter apps can instead be passed a string that names the shared queue to which to attach to.
Starting either the transmitter or receiver app attaches them to a shared packet queue visible to the process group under the name that was given to the app. When the queue identified by the name is unavailable, because it is already in use by a pair of processes within the group, configuration of the app network will block until the queue becomes available. Once the transmitter or receiver apps are stopped they detach from the queue.
Only two processes (one receiver and one transmitter) can be attached to an interlink queue at the same time, but during the lifetime of the queue (e.g., from when the first process attached to when the last process detaches) it can be shared by any number of receivers and transmitters. Meaning, either process attached to the queue can be restarted or replaced by another process without packet loss.
The checksum module provides an optimized ones-complement checksum routine.
— Function ipsum pointer length initial
Return the ones-complement checksum for the given region of memory.
pointer is a pointer to an array of data to be checksummed. initial is an unsigned 16-bit number in host byte order which is used as the starting value of the accumulator. The result is the IP checksum over the data in host byte order.
The initial argument can be used to verify a checksum or to calculate the checksum in an incremental manner over chunks of memory. The synopsis to check whether the checksum over a block of data is equal to a given value is the following
if ipsum(pointer, length, value) == 0 then
-- checksum correct
else
-- checksum incorrect
end
To chain the calculation of checksums over multiple blocks of data together to obtain the overall checksum, one needs to pass the one’s complement of the checksum of one block as initial value to the call of ipsum() for the following block, e.g.
local sum1 = ipsum(data1, length1, 0)
local total_sum = ipsum(data2, length2, bit.bnot(sum1))
This function takes advantage of SIMD hardware when available.
A ctable is a hash table whose keys and values are instances of FFI data types. In Lua parlance, an FFI value is a “cdata” value, hence the name “ctable”.
A ctable is parameterized for the specific types for its keys and values. This allows for the table to be stored in an efficient manner. Adding an entry to a ctable will copy the value into the table. Logically, the table “owns” the value. Lookup can either return a pointer to the value in the table, or copy the value into a user-supplied buffer, depending on what is most convenient for the user.
To create a ctable, first create a parameters table specifying the key and value types, along with any other options. Then call ctable.new
on those parameters. For example:
local ctable = require('lib.ctable')
local ffi = require('ffi')
local params = {
key_type = ffi.typeof('uint32_t'),
value_type = ffi.typeof('int32_t[6]'),
max_occupancy_rate = 0.4,
initial_size = math.ceil(occupancy / 0.4)
}
local ctab = ctable.new(params)
— Function ctable.new parameters
Create a new ctable. parameters is a table of key/value pairs. The following keys are required:
key_type
: An FFI type (LuaJIT “ctype”) for keys in this table.value_type
: An FFI type (LuaJT “ctype”) for values in this table.Optional entries that may be present in the parameters table include:
hash_seed
: A hash seed, as a 16-byte array. The hash value of a key is a function of the key and also of the hash seed. Using a hash function with a seed prevents some kinds of denial-of-service attacks against network functions that use ctables. The seed defaults to a fresh random byte string. The seed also changes whenever a table is resized.initial_size
: The initial size of the hash table, including free space. Defaults to 8 slots.max_occupancy_rate
: The maximum ratio of occupancy/size
, where occupancy
denotes the number of entries in the table, and size
is the total table size including free entries. Trying to add an entry to a “full” table will cause the table to grow in size by a factor ofmin_occupancy_rate
: Minimum ratio of occupancy/size
. Removing an entry from an “empty” table will shrink the table.resize_callback
: An optional function that is called after the table has been resized. The function is called with two arguments: the ctable object and the old size. By default, no callback is used.— Function ctable.load stream parameters
Load a ctable that was previously saved out to a binary format. parameters are as for ctable.new
. stream should be an object that has a :read_ptr(ctype) method, which returns a pointer to an embedded instances of ctype in the stream, advancing the stream over the object; and :read_array(ctype, count) which is the same but reading count instances of ctype instead of just one.
Users interact with a ctable through methods. In these method descriptions, the object on the left-hand-side of the method invocation should be a ctable.
— Method :resize size
Resize the ctable to have size total entries, including empty space.
— Method :add key, value, updates_allowed
Add an entry to the ctable, returning the index of the added entry. key and value are FFI values for the key and the value, of course.
updates_allowed is an optional parameter. If not present or false, then the :insert
method will raise an error if the key is already present in the table. If updates_allowed is the string "required"
, then an error will be raised if key is not already in the table. Any other true value allows updates but does not require them. An update will replace the existing entry in the table.
Returns a pointer to the inserted entry. Any subsequent modification to the table may invalidate this pointer.
— Method :update key, value
Update the entry in a ctable with the key key to have the new value value. Throw an error if key is not present in the table.
— Method :lookup_ptr key
Look up key in the table, and if found return a pointer to the entry. Return nil if the value is not found.
An entry pointer has three fields: the hash
value, which must not be modified; the key
itself; and the value
. Access them as usual in Lua:
local ptr = ctab:lookup(key)
if ptr then print(ptr.value) end
Note that pointers are only valid until the next modification of a table.
— Method :lookup_and_copy key, entry
Look up key in the table, and if found, copy that entry into entry and return true. Otherwise return false.
— Method :remove_ptr entry
Remove an entry from a ctable. entry should be a pointer that points into the table. Note that pointers are only valid until the next modification of a table.
— Method :remove key, missing_allowed
Remove an entry from a ctable, keyed by key.
Return true if we actually do find a value and remove it. Otherwise if no entry is found in the table and missing_allowed is true, then return false. Otherwise raise an error.
— Method :save stream
Save a ctable to a byte sink. stream should be an object that has a :write_ptr(ctype) method, which writes an instance of a struct type out to a stream, and :write_array(ctype, count) which is the same but writing count instances of ctype instead of just one.
— Method :selfcheck
Run an expensive internal diagnostic to verify that the table’s internal invariants are fulfilled.
— Method :dump
Print out the entries in a table. Can be expensive if the table is large.
— Method :iterate
Return an iterator for use by for in
. For example:
for entry in ctab:iterate() do
print(entry.key, entry.value)
end
As an implementation detail, the table is stored as an open-addressed robin-hood hash table with linear probing. Ctables use the high-quality SipHash hash function to allow for good distribution of hash values. To find a value associated with a key, a ctable will first hash the key, map that hash value to an index into the table by scaling the hash to the table size, and then scan forward in the table until we find an entry whose hash value is greater than or equal to the hash in question. Each entry stores its hash value, and empty entries have a hash of 0xFFFFFFFF
. If the entry’s hash matches and the entry’s key is equal to the one we are looking for, then we have our match. If the entry’s hash is greater than our hash, then we have a failure. Hash collisions are possible as well of course; in that case we continue scanning forward.
The distance travelled while scanning for the matching hash is known as the displacement. The table measures its maximum displacement, for a number of purposes, but you might be interested to know that a maximum displacement for a table with 2 million entries and a 40% load factor is around 8 or 9. Smaller tables will have smaller maximum displacements.
The ctable has two lookup interfaces. The first one is the lookup
methods described above. The other interface will fetch all entries within the maximum displacement into a buffer, then do a branchless binary search over that buffer. This second streaming lookup can also fetch entries for multiple keys in one go. This can amortize the cost of a round-trip to RAM, in the case where you expect to miss cache for every lookup.
To perform a streaming lookup, first prepare a LookupStreamer
for the batch size that you need. You will have to experiment to find the batch size that works best for your table’s entry sizes; for reference, for 32-byte entries a 32-wide lookup seems to be optimum.
-- Stream in 32 lookups at once.
local stride = 32
local streamer = ctab:make_lookup_streamer(stride)
Wiring up streaming lookup in a packet-processing network is a bit of a chore currently, as you have to maintain separate queues of lookup keys and packets, assuming that each lookup maps to a packet. Let’s make a little helper:
local lookups = {
queue = ffi.new("struct packet * [?]", stride),
queue_len = 0,
streamer = streamer
}
local function flush(lookups)
if lookups.queue_len > 0 then
-- Here is the magic!
lookups.streamer:stream()
for i = 0, lookups.queue_len - 1 do
local pkt = lookups.queue[i]
if lookups.streamer:is_found(i)
local val = lookups.streamer.entries[i].value
--- Do something cool here!
end
end
lookups.queue_len = 0
end
end
local function enqueue(lookups, pkt, key)
local n = lookups.queue_len
lookups.streamer.entries[n].key = key
lookups.queue[n] = pkt
n = n + 1
if n == stride then
flush(lookups)
else
lookups.queue_len = n
end
end
Then as you see packets, you enqueue them via enqueue
, extracting out the key from the packet in some way and passing that value as the argument. When enqueue
detects that the queue is full, it will flush it, performing the lookups in parallel and processing the results.
An implementation of Poptrie. Includes high-level functions for building the Poptrie data structure, as well as a hand-written, optimized assembler lookup routine.
local pt = poptrie.new{direct_pointing=true}
-- Associate prefixes of length to values (uint16_t)
pt:add(ipv4:pton("192.168.0.0"), 16, 1)
pt:add(ipv4:pton("192.0.0.0"), 8, 2)
pt:build()
pt:lookup32(ipv4:pton("192.1.2.3")) ⇒ 2
pt:lookup32(ipv4:pton("192.168.2.3")) ⇒ 1
-- The value zero denotes "no match"
pt:lookup32(ipv4:pton("193.1.2.3")) ⇒ 0
-- You can create a pre-built poptrie from its backing memory.
local pt2 = poptrie.new{
nodes = pt.nodes,
leaves = pt.leaves,
directmap = pt.directmap
}
Note that performance tends to be memory-bound. The results below reflect ideal conditions with hot caches. See Benchmarking Poptrie.
PMU analysis (numentries=10000, numhit=100, keysize=32)
build: 0.1857 seconds
lookup: 8460.17 cycles/lookup 18089.70 instructions/lookup
lookup32: 62.71 cycles/lookup 99.99 instructions/lookup
lookup64: 64.11 cycles/lookup 100.00 instructions/lookup
lookup128: 74.44 cycles/lookup 118.66 instructions/lookup
build(direct_pointing): 0.1676 seconds
lookup(direct_pointing): 1306.68 cycles/lookup 3146.96 instructions/lookup
lookup32(direct_pointing): 35.49 cycles/lookup 62.61 instructions/lookup
lookup64(direct_pointing): 35.95 cycles/lookup 62.61 instructions/lookup
lookup128(direct_pointing): 37.75 cycles/lookup 66.81 instructions/lookup
— Function new init
Creates and returns a new Poptrie
object.
Init is a table with the following keys:
direct_pointing
- Optional. Boolean that governs whether to use the direct pointing optimization. Default is false
.s
- Optional. Bits to use for the direct pointing optimization. Default is 18. Note that the direct map array will be 2×2ˢ bytes in size.leaves
- Optional. An array of leaves. When leaves is supplied nodes must be supplied as well.nodes
- Optional. An array of nodes. When nodes is supplied leaves must be supplied as well.directmap
- Optional. A direct map array. When directmap is supplied, nodes and leaves must be supplied as well and direct_pointing is implicit.— Method Poptrie:add prefix length value
Associates value to prefix of length. Prefix must be a uint8_t *
pointing to at least math.ceil(length/8)
bytes. Length must be an integer equal to or greater than 1. Value must be a 16‑bit unsigned integer, and should be greater than zero (see lookup*
as to why.)
— Method Poptrie:build
Compiles the optimized poptrie data structure used by lookup64
. After calling this method, the leaves and nodes fields of the Poptrie
object will contain the leaves and nodes arrays respectively. These arrays can be used to construct a Poptrie
object.
— Method Poptrie:lookup32 key
— Method Poptrie:lookup64 key
— Method Poptrie:lookup128 key
Looks up key in the Poptrie
object and returns the associated value or zero. Key must be a uint8_t *
pointing to at least 4/8/16 bytes respectively.
Unless the Poptrie
object was initialized with leaves and nodes arrays, the user must call Poptrie:build
before calling Poptrie:lookup64
.
It is an error to call these lookup routines on poptries that contain prefixes longer than supported by the individual lookup routine. I.e., you can only call lookup64
on poptries with prefixes of less than or equal to 64 bits.
The CPU’s PMU (Performance Monitoring Unit) collects information about specific performance events such as cache misses, branch mispredictions, and utilization of internal CPU resources like execution units. This module provides an API for counting events with the PMU.
Hundreds of low-level counters are available. The exact list depends on CPU model. See pmu_cpu.lua for our definitions.
— Function is_available
If the PMU hardware is available then return true. Otherwise return two values: false and a string briefly explaining why. (Cooperation from the Linux kernel is required to acess the PMU.)
— Function profile function [event_list] [aux]
Call function, return the result, and print a human-readable report of the performance events that were counted during execution.
— Function measure function [event_list]
Call function and return two values: the result and a table of performance event counter tallies.
— Function setup event_list
Setup the hardware performance counters to track a given list of events (in addition to the built-in fixed-function counters).
Each event is a Lua string pattern. This could be a full event name:
mem_load_uops_retired.l1_hit
or a more general pattern that matches several counters:
mem_load.*l._hit
Return the number of overflowed counters that could not be tracked due to hardware constraints. These will be the last counters in the list.
Example:
setup({"uops_issued.any",
"uops_retired.all",
"br_inst_retired.conditional",
"br_misp_retired.all_branches"}) => 0
— Function new_counter_set
Return a counter_set
object that can be used for accumulating events. The counter_set will be valid only until the next call to setup().
— Function switch_to counter_set
Switch to a new set of counters to accumulate events in. Has the side-effect of committing the current accumulators to the previous record.
If counter_set is nil then do not accumulate events.
— Function to_table counter_set
Return a table containing the values accumulated in counter_set.
Example:
to_table(cs) =>
{
-- Fixed-function counters
instructions = 133973703,
cycles = 663011188,
ref-cycles = 664029720,
-- General purpose counters selected with setup()
uops_issued.any = 106860997,
uops_retired.all = 106844204,
br_inst_retired.conditional = 26702830,
br_misp_retired.all_branches = 419
}
— Function report counter_set [aux]
Print a textual report on the values accumulated in a counter set. Optionally include auxiliary application-level counters. The ratio of each event to each auxiliary counter is also reported.
Example:
report(my_counter_set, {packet = 26700000, breath = 208593})
prints output approximately like:
EVENT TOTAL /packet /breath
instructions 133,973,703 5.000 642.000
cycles 663,011,188 24.000 3178.000
ref-cycles 664,029,720 24.000 3183.000
uops_issued.any 106,860,997 4.000 512.000
uops_retired.all 106,844,204 4.000 512.000
br_inst_retired.conditional 26,702,830 1.000 128.000
br_misp_retired.all_branches 419 0.000 0.000
packet 26,700,000 1.000 128.000
breath 208,593 0.008 1.000
lib.yang
)YANG is a data modelling language designed for use in networking equipment, standardized as RFC 6020. The lib.yang
modules provide YANG facilities to Snabb applications, allowing operators to understand how to work with a Snabb data plane and also providing convenient configuration facilities for data-plane authors.
Everything in YANG starts with a schema: a specification of the data model of a device. For example, consider a simple Snabb router that receives IPv4 traffic and sends it out one of 12 ports. We might model it like this:
module snabb-simple-router {
namespace snabb:simple-router;
prefix simple-router;
import ietf-inet-types {prefix inet;}
leaf active { type boolean; default true; }
container routes {
list route {
key addr;
leaf addr { type inet:ipv4-address; mandatory true; }
leaf port { type uint8 { range 0..11; } mandatory true; }
}
}
}
Given this schema, lib.yang
can automatically derive a configuration file format for this Snabb program and create a parser that applies the validation constraints from the schema. The result is a simple plain-old-data Lua object that the data-plane can use directly.
Additionally there is support for efficient binary compilation of configurations. The problem is that even in this simple router, the routing table can grow quite large. While particular applications can sometimes incrementally update their configurations without completely reloading the configuration from the start, in general reloading is almost always a possibility, and you want to avoid packet loss during the time that the millions of routing table entries are loaded and validated.
For that reason the lib.yang
code also defines a mapping that, given a YANG schema, can compile any configuration for that schema into a pre-validated binary file that the data-plane can just load up directly. Additionally for list
nodes that map between keys and values, the lib.yang
facilities can compile that map into an efficient ctable
, letting the data-plane use the configuration as-is.
The schema given above can be loaded from a string using load_schema
from the lib.yang.schema
module, from a file via load_schema_file
, or by name using load_schema_by_name
. This last interface allows one to compile a YANG schema into the Snabb binary directly; if we name the above file snabb-simple-router.yang
and place it in the src/lib/yang
directory, then load_schema_by_name('snabb-simple-router')
will find it appropriately. Indeed, this is how the ietf-inet-types
import in the above example was resolved.
Consider again the example snabb-simple-router
schema. To configure a router, we need to provide a configuration in a way that the application can understand. In Snabb, we derive this configuration syntax from the schema, in the following way:
A module
’s configuration is composed of the configurations of all data nodes (container
, leaf-list
, list
, and leaf
) nodes inside it.
A leaf
’s configuration is like keyword value;
, where the keyword is the name of the leaf, and the value is in the right syntax for the leaf’s type. (More on value types below.)
A container
’s configuration is the container’s keyword followed by the configuration of its data node children, like keyword { configuration... }
.
A leaf-list
’s configuration is a sequence of 0 or more instances of keyword value;
, as in leaf
.
A list
’s configuration is a sequence of 0 or more instances of the form keyword { configuration... }
, again where keyword
is the list name and configuration...
indicates the configuration of child data nodes.
Concretely, for the example configuration above, the above algorithm derives a configuration format of the following form:
(active true|false;)?
(routes {
(route { addr ipv4-address; port uint8; })*
})?
In this grammar syntax, (foo)?
indicates either 0 or 1 instances of foo
, (foo)*
is similar bit indicating 0 or more instances, and |
expresses alternation.
An example configuration might be:
active true;
routes {
route { addr 1.2.3.4; port 1; }
route { addr 2.3.4.5; port 10; }
route { addr 3.4.5.6; port 2; }
}
Except in special cases as described in RFC 6020, order is insignificant. You could have active false;
at the end, for example, and route { addr 1.2.3.4; port 1; }
is the same as route { port 1; addr 1.2.3.4; }
.
The surface syntax of our configuration format is the same as for YANG schemas; "1.2.3.4"
is the same as 1.2.3.4
. Snabb follows the XML mapping guidelines of how to represent data described by a YANG schema, except that it uses YANG syntax instead of XML syntax. We could generate XML instead, but we want to avoid bringing in the complexities of XML parsing to Snabb. We also think that the result is a syntax that is pleasant and approachable to write by hand; we want to make sure that everyone can use the same configuration format, regardless of whether they are configuring Snabb via an external daemon like sysrepo
or whether they write configuration files by hand.
Loading a schema and using it to parse a data file can be a bit expensive, especially if the data file includes a large routing table or other big structure. It can be useful to pay for this this parsing and validation cost “offline”, without interrupting a running data plane.
For this reason, Snabb support compiling configurations to binary data. A data plane can load a compiled configuration without any validation, very cheaply. Users can explicitly call the compile_config_for_schema
or compile_config_for_schema_by_name
functions. Support is planned also for automatic compilation and of source configuration files as well, so that the user can just edit configurations as text and still take advantage of the speedy binary configuration loads when nothing has changed.
[TODO] We will need to be able to serialize a configuration back to source, for when a user asks what the configuration of a device is. We will also need to serialize partial configurations, for when the user asks for just a part of the configuration.
[TODO] We will need to support updating the configuration of a running snabb application. We plan to compile the candidate configuration in a non-worker process, then signal the worker to reload its configuration.
[TODO] We will need to support incremental configuration updates, for example to add or remove a binding table entry for the lwAFTR. In this way we can avoid a full reload of the configuration, minimizing packet loss.
[TODO] We need to map the state data exported by a Snabb process (counters, etc) to YANG-format data. Perhaps this can be done in a similar way as configuration compilation: the configuration facility in the Snabb binary compiles a YANG state data file and periodically updates it by sampling the data plane, and then we re-use the configuration serialization facilities to serialize (potentially partial) state data.
The public entry point to the YANG library is the lib.yang.yang
module, which exports the following bindings. Note that unless you have special needs, probably the only one you want to use is load_configuration
.
— Function load_configuration filename parameters
Load a configuration from disk. If filename is a compiled configuration, load it directly. Otherwise it must be a source file. In that case, try to load a corresponding compiled file instead if possible. If all that fails, actually parse the source configuration, and try to residualize a corresponding compiled file so that we won’t have to go through the whole thing next time.
parameters is a table of key/value pairs. The following key is required:
schema_name
: The name of the YANG schema that describes the configuration. This is the name that appears as the id in module id { ... }
in the schema.Optional entries that may be present in the parameters table include:
verbose
: Set to true to print verbose information about which files are being loaded and compiled.revision_date
: If set, assert that the loaded configuration was built against this particular schema revision date.For more information on the format of the returned value, see the documentation below for load_config_for_schema
.
— Function load_schema src filename
Load a YANG schema from the string src. filename is an optional file name for use in error messages. Returns a YANG schema object.
Schema objects do have useful internal structure but they are not part of the documented interface.
— Function load_schema_file filename
Load a YANG schema from the file named filename. Returns a YANG schema object.
— Function load_schema_by_name name revision
Load the given named YANG schema. The name indicates the canonical name of the schema, which appears as module *name* { ... }
in the YANG schema itself, or as import *name* { ... }
in other YANG modules that import this module. revision optionally indicates that a certain revision data should be required.
— Function add_schema src filename
Add the YANG schema from the string src to Snabb’s database of YANG schemas, making it available to load_schema_by_name
and related functionality. filename is used when signalling any parse errors. Returns the name of the newly added schema.
— Function add_schema_file filename
Like add_schema
, but reads the YANG schema in from a file. Returns the name of the newly added schema.
— Function load_config_for_schema schema src filename
Given the schema object schema, load the configuration from the string src. Returns a parsed configuration as a plain old Lua value that tries to represent configuration values using appropriate Lua types.
The top-level result from parsing will be a table whose keys are the top-level configuration options. For example in the above example:
active true;
routes {
route { addr 1.2.3.4; port 1; }
route { addr 2.3.4.5; port 10; }
route { addr 3.4.5.6; port 2; }
}
In this case, the result would be a table with two keys, active
and routes
. The value of the active
key would be Lua boolean true
.
The routes
container is just another table of the same kind.
Inside the routes
container is the route
list, which is represented as an associative array. The particular representation for the associative array depends on characteristics of the list
type; see below for details. In this case the route
list compiles to a ctable
. Therefore to get the port for address 1.2.3.4, you would do:
local yang = require('lib.yang.yang')
local ipv4 = require('lib.protocol.ipv4')
local data = yang.load_config_for_schema(router_schema, conf_str)
local port = data.routes.route:lookup_ptr(ipv4:pton('1.2.3.4')).value.port
assert(port == 1)
Here we see that integer values like the port
leaves are represented directly as Lua numbers, if they fit within the uint32
or int32
range. Integers outside that range are represented as uint64_t
if they are positive, or int64_t
otherwise.
Boolean values are represented using normal Lua booleans, of course.
String values are just parsed to Lua strings, with the normal Lua limitation that UTF-8 data is not decoded. Lua strings look like strings but really they are byte arrays.
There is special support for the ipv4-address
, ipv4-prefix
, ipv6-address
, and ipv6-prefix
types from ietf-inet-types
, and mac-address
from ietf-yang-types
. Values of these types are instead parsed to raw binary data that is compatible with the relevant parts of Snabb’s lib.protocol
facility.
Let us return to the representation of compound configurations, like list
instances. A compound configuration whose shape is fixed is compiled to raw FFI data. A configuration’s shape is determined by its schema. A schema node whose data will be fixed is either a leaf whose type is numeric or boolean and which is either mandatory or has a default value, or a container (leaf-list
, container
, or list
) whose elements are all themselves fixed.
In practice this means that a fixed container
will be compiled to an FFI struct
type. This is mostly transparent from the user perspective, as in LuaJIT you access struct members by name in the same way as for normal Lua tables.
A fixed leaf-list
will be compiled to an FFI array of its element type, but on the Lua side is given the normal 1-based indexing and support for the #len
length operator via a wrapper. A non-fixed leaf-list
is just a Lua array (a table with indexes starting from 1).
Instances of list
nodes can have one of several representations. (Recall that in YANG, list
is not a list in the sense that we normally think of it in programming languages, but rather is a kind of hash map.)
If there is only one key leaf, and that leaf has a string type, then a configuration list is represented as a normal Lua table whose keys are the key strings, and whose values are Lua structures holding the leaf values, as in containers. (In fact, it could be that the value of a string-key struct is represented as a C struct, as in raw containers.)
If all key and value types are fixed, then a list
configuration compiles to an efficient ctable
.
If all keys are fixed but values are not, then a list
configuration compiles to a cltable
.
Otherwise, a list
configuration compiles to a Lua table whose keys are Lua tables containing the keys. This sounds good on the surface but really it’s a pain, because you can’t simply look up a value in the table like foo[{key1=42,key2=50}]
, because lookup in such a table is by identity and not be value. Oh well. You can still do for k,v in pairs(foo)
, which is often good enough in this case.
Note that there are a number of value types that are not implemented, including some important ones like union
.
— Function load_config_for_schema_by_name schema_name name filename
Like load_config_for_schema
, but identifying the schema by name instead of by value, as in load_schema_by_name
.
— Function print_config_for_schema schema data file
Serialize the configuration data as text via repeated calls to the write
method of file. At the end, the flush
method is called on file. schema is the schema that describes data.
— Function compile_config_for_schema schema data filename mtime
Compile data, using a compiler generated for schema, and write out the result to the file named filename. mtime, if given, should be a table with secs
and nsecs
keys indicating the modification time of the source file. This information will be serialized in the compiled file, and may be used when loading the file to determine whether the configuration is up to date.
— Function compile_config_for_schema_by_name schema_name data filename mtime
Like compile_config_for_schema_by_name
, but identifying the schema by name instead of by value, as in load_schema_by_name
.
— Function load_compiled_data_file filename
Load the compiled data file at filename. If the file is not a compiled YANG configuration, an error will be signalled. The return value will be table containing four keys:
schema_name
: The name of the schema for which this file was compiled.revision_date
: The revision date of the schema for which this file was compiled, or the empty string (''
) if unknown.source_mtime
: An mtime
table, as for compile_config_for_schema
. If no mtime was written into the file, both secs
and nsecs
will be zero.data
: The configuration data, in the same format as returned by load_config_for_schema
.The lib.hardware.pci
module provides functions that abstract common operations on PCI devices on Linux. In order to drive a PCI device using Direct memory access (DMA) one must:
pci.open_pci_resource_locked
or pci.open_pci_resource_unlocked
.pci.unbind_device_from_linux
.pci.set_bus_master
in order to enable DMA.pci.map_pci_memory
.pci.map_pci_memory
.pci.set_bus_master
.pci.close_pci_resource
.The correct ordering of these steps is absolutely critical.
Users of lib.hardware.pci
can rely on steps 6/7 being performed automatically in the event unorderly shutdown. However, to ensure that bus mastering for the PCI device in use is not disabled due to another worker’s shutdown (see core.worker
) they must keep a flock(2)
on resource 0. This can be achieved either implicitly via pci.open_pci_resource_locked
or by manual calls to flock(2)
.
— Variable pci.devices
An array of supported hardware devices. Must be populated by calling pci.scan_devices
. Each entry is a table as returned by pci.device_info
.
— Function pci.canonical pciaddress
Returns the canonical representation of a PCI address. The canonical representation is preferred internally in Snabb and for presenting to users. It shortens addresses with leading zeros like this: 0000:01:00.0
becomes 01:00.0
.
— Function pci.qualified pciaddress
Returns the fully qualified representation of a PCI address. Fully qualified addresses have the form 0000:01:00.0
and so this function undoes any abbreviation in the canonical representation.
— Function pci.scan_devices
Scans for available PCI devices and populates the pci.devices
table.
— Function pci.device_info pciaddress
Returns a table containing information about the PCI device by pciaddress. The table has the following keys:
pciaddress
—String denoting the PCI address of the device. E.g. "0000:83:00.1"
.vendor
—Identification string e.g. "0x8086"
for Intel.device
—Identification string e.g. "0x10fb"
for 82599 chip.interface
—Name of Linux interface using this device e.g. "eth0"
.status
—String denoting the Linux operational status, or nil
if not known.driver
—String denoting the Lua module that supports this hardware e.g. "apps.intel.intel10g"
.usable
—String denoting if the device was suitable to use when scanned. One of "yes"
or "no"
.— Function pci.which_driver vendor, model
Returns the module name for a suitable device driver (if available) for a device of model from vendor.
— Function pci.unbind_device_from_linux pciaddress
Forces Linux to unbind the device identified by pciaddress from any kernel drivers.
— Function pci.set_bus_master pciaddress, enable
Enables or disables PCI bus mastering for device identified by pciaddress depending on whether enable is a true or a false value. PCI bus mastering must be enabled in order to perform DMA on the PCI device.
— Function pci.open_pci_resource_unlocked pciaddress, n — Function pci.open_pci_resource_locked pciaddress, n
Opens configuration space n of PCI device identified by pciaddress. Returns a file descriptor of the opened sysfs resource file.
The two variants indicate if the underlying memory mapped file should be exclusively flocked
or not.
— Function pci.map_pci_memory fd
Memory maps configuration space of PCI device identified by fd. Returns a pointer to the memory mapped region. The device must be unbound from linux and PCI bus mastering must be enabled on the device before calling this function.
— Function pci.close_pci_resource file_descriptor, pointer
Closes memory mapped file_descriptor of sysfs resource file and unmaps it from pointer as returned by pci.map_pci_memory
.
The lib.hardware.register
module provides an abstraction for hardware device registers. This abstraction can be used to declaratively specify and conveniently manipulate structured memory regions via DMA. The functions register.define
and register.define_array
construct Register
objects based on a register description string. The resulting Register
objects can be used to manipulate the defined registers using the methods Register:read
, Register:write
, Register:set
, Register:clr
, Register:wait
and Register:reset
(exact set depends on the register mode).
A register description is a string with one Register
object definition per line. A Register
object definition must be expressed using the following grammar:
Register ::= Name Offset Indexing Mode Longname
Name ::= <identifier>
Indexing ::= "-"
::= "+" OffsetStep "*" Min ".." Max
Mode ::= "RO" | "RW" | "RC" | "RCR" | "RW64" | "RO64" | "RC64" | "RCR64"
Longname ::= <string>
Offset ::= OffsetStep ::= Min ::= Max ::= <number>
A Register
object definition is made up of the following properties:
Register
object. Must be a valid Lua identifier, e.g. "foo"
, "foo_bar"
, "FOO"
etc.register.define
and register.define_array
)."RO"
, "RW"
, "RC"
, "RCR"
"RO64"
, "RW64"
, "RC64"
, "RCR64"
standing for read-only, read-write and counter modes in 32bit and 64bit modes respectively. Counter mode is for counter registers that clear back to zero when read, RCR is for counters that wrap.For instance, the following Register
object definition defines a register range “TXDCTL” in read-write mode starting at offset 0x06028 with 128 registers each of length 0x40.
TXDCTL 0x06028 +0x40*0..127 RW Transmit Descriptor Control
The next example defines a singular register “TPT” in counter mode located at offset 0x01428.
TPT 0x01428 - RC Total Packets Transmitted
— Function register.define description, table, base_pointer, n
Creates Register
objects for description relative to base_pointer. The resulting Register
objects will become a named entries in table using the names defined in description. If an entry in description defines an indexing range then n specifies the index of the register within that range. N defaults to 0.
— Function register.define_array description, table, base_pointer
Creates Register
objects for description relative to base_pointer. The resulting Register
objects will become a named entries in table using the names defined in description. If an entry in description defines an indexing range, an array of Register
objects will be created instead of a singular Register
object.
— Function register.dump table
Prints a pretty-printed register dump of a table of registers.
— Method Register:read
Returns the value of register. For convenience register objects can be called without arguments instead of calling Register:read
. E.g. reg:read()
is equivalent to reg()
.
— Method Register:write value
Sets the value of register to value. Only available on registers in read-write mode. For convenience register objects can be called with an argument instead of calling Register:write
. E.g. reg:write(value)
is equivalent to reg(value)
.
If register is in counter mode it is assumed that the register will be reset to zero upon reading. The read value is added to a register accumulator and the sum of all reads is returned.
— Method Register:set bitmask
Sets bits of register according to bitmask. Only available on registers in read-write mode.
— Method Register:clr bitmask
Clears bits of register according to bitmask. Only available on registers in read-write mode.
Get or set length bits at offset in register. Sets length bits at offset in register to bits if bits is supplied. Returns length bits at offset in register otherwise. Setting is only available on registers in read-write mode.
Get or set byte at offset in register. Sets byte at offset in register to byte if byte is supplied. Returns byte at offset in register otherwise. Setting is only available on registers in read-write mode.
— Method Register:wait bitmask, value
Blocks until applying bitmask to the register equals value. If value is not supplied blocks until all bits in the mask are set instead. Only available on registers in read-write and read-only modes.
— Method Register:reset
Reset the register accumulator to 0. Only available on registers in counter mode.
— Method Register:print
Prints the register state to standard output.
The lib.protocol.header
module contains the base class from which the supported protocol classes are derived. It defines generic methods on all protocol subclasses.
— Method header:new_from_mem memory, length
Creates and returns a header object by “overlaying” the respective header structure over length bytes of memory. Returns nil
if length is too small to contain the header.
— Method header:header
Returns the raw header as a cdata object.
— Method header:sizeof
Returns the byte size of header.
— Method header:eq header
Generic equality predicate. Returns true
if header is equal to self and false
otherwise.
— Method header:copy destination, relocate
Copies the header to destination. The caller must ensure that there is enough space at destination. If relocate is a true value, destination is promoted to be the active storage for the header.
— Method header:clone
Returns a copy of the header object.
— Method header:upper_layer
Returns the protocol class that can handle the “upper layer protocol” or nil
if the protocol is not supported or the protocol has no upper layer.
For instance, on an Ethernet header object this method might return a IPv4 or IPv6 header class.
The lib.protocol.ethernet
module contains a class for representing Ethernet headers. The ethernet
protocol class supports two upper layer protocols: lib.protocol.ipv4
and lib.protocol.ipv6
.
— Method ethernet:new config
Returns a new Ethernet header for config. Config must a be a table which may contain the following keys:
dst
- Destination MAC (binary representation). Default is 00:00:00:00:00:00
.src
- Source MAC (binary representation). Default is 00:00:00:00:00:00
.type
- Either 0x0800
or 0x86dd
for IPv4/6 individually. Default is 0x0
.— Method ethernet:src mac
— Method ethernet:dst mac
— Method ethernet:type type
Combined accessor and setter methods. These methods set the values of the source, destination and type fields of an Ethernet header. If no argument is given the current value is returned.
Example:
local eth = ethernet:new({src = ethernet:pton("00:00:00:00:00:00"),
dst = ethernet:pton("00:00:00:00:00:00"),
type = 0x86dd})
eth:dst(ethernet:pton("54:52:00:01:00:00"))
ethernet:ntop(eth:dst()) => "54:52:00:01:00:00"
— Method ethernet:src_eq mac
— Method ethernet:dst_eq mac
Predicate methods to test if mac is equal to the source or destination addresses individually.
— Method ethernet:swap
Swaps the values of the source and destination fields.
— Function ethernet:pton string
Returns the binary representation of MAC address denoted by string.
— Function ethernet:ntop mac
Returns the string representation of mac address.
— Function ethernet:is_mcast mac
Returns a true value if mac address denotes a Multicast address.
— Function ethernet:is_bcast mac
Returns a true value if mac address denotes a Broadcast address.
— Function ethernet:ipv6_mcast ip
Returns the MAC address for IPv6 multicast ip as defined by RFC2464, section 7.
The lib.protocol.ipv4
module contains a class for representing IPv4 headers. The ipv4
protocol class supports four upper layer protocols: lib.protocol.tcp
, lib.protocol.udp
, lib.protocol.gre
and lib.protocol.icmp.header
.
— Method ipv4:new config
Returns a new IPv4 header for config. Config must a be a table which may contain the following keys:
dst
- Destination IPv4 address (binary representation). Default is 0.0.0.0
.src
- Source IPv4 address (binary representation). Default is 0.0.0.0
.protocol
- The upper layer protocol, can be 6 (TCP), 17 (UDP), 47 (GRE) or 58 (ICMP). Default is 255.dscp
- “Differentiated Services Code Point” field (6 bit unsigned integer). Default is 0.ecn
- “Explicit Congestion Notification” field (2 bit unsigned integer). Default is 0.id
- “Identification” field (16 bit unsigned integer). Default is 0.flags
- “Don’t Fragment (DF)” and “More Fragments (MF)” fields (3 bit unsigned integer). Default is 0.frag_off
- “Fragment Offset” field (13 bit unsigned integer). Default is 0.ttl
- “Time To Live” field (8 bit unsigned integer). Default is 0.— Method ipv4:dst ip
— Method ipv4:src ip
— Method ipv4:protocol protocol
— Method ipv4:dscp dscp
— Method ipv4:ecn ecn
— Method ipv4:id id
— Method ipv4:flags flags
— Method ipv4:frag_off frag_off
— Method ipv4:ttl ttl
Combined accessor and setter methods. These methods set the values of the instance fields (see new
) of an IPv4 header. If no argument is given the current value is returned.
— Method ipv4:version version
Combined accessor and setter method for the “Version” field (4 bit unsigned integer). Defaults to 4 (set automatically by new
). Sets the “Version” field to version. If no argument is given the current value is returned.
— Method ipv4:ihl ihl
Combined accessor and setter method for the “Internet Header Length” field (4 bit unsigned integer). Set automatically by new
. Sets the “Internet Header Length” field to ihl. If no argument is given the current value is returned.
— Method ipv4:total_length length
Combined accessor and setter method for the “Total Length” field (16 bit unsigned integer). Defaults to header length (set automatically by new
). Sets the “Total Length” field to length. If no argument is given the current value is returned.
— Method ipv4:checksum
Computes and sets the IPv4 header checksum. Its called automatically by new
but must be called after the header is changed.
— Method ipv4:dst_eq ip
— Method ipv4:src_eq ip
Predicate methods to test if ip is equal to the source or destination addresses individually.
— Function ipv4:pton string
Returns the binary representation of IPv4 address denoted by string.
— Function ipv4:ntop ip
Returns the string representation of ip address.
The lib.protocol.ipv6
module contains a class for representing IPv6 headers. The ipv6
protocol class supports four upper layer protocols: lib.protocol.tcp
, lib.protocol.udp
, lib.protocol.gre
and lib.protocol.icmp.header
.
— Method ipv6:new config
Returns a new IPv6 header for config. Config must a be a table which may contain the following keys:
dst
- Destination IPv6 address (binary representation). Default is 0::0
.src
- Source IPv6 address (binary representation). Default is 0::0
.traffic_class
- “Traffic Class” field (8 bit unsigned integer). Default is 0.flow_label
- “Flow Label” field (20 bit unsigned integer). Default is 0.next_header
- “Next Header” field (8 bit unsigned integer). Default is 0.hop_limit
- “Hop Limit” field (8 bit unsigned integer). Default is 0.— Method ipv6:dst ip
— Method ipv6:src ip
— Method ipv6:traffic_class traffic_class
— Method ipv6:flow_label flow_label
— Method ipv6:next_header next_header
— Method ipv6:hop_limit hop_limit
Combined accessor and setter methods. These methods set the values of the instance fields (see new
) of an IPv6 header. If no argument is given the current value is returned.
— Method ipv6:version version
Combined accessor and setter method for the version field (4 bit unsigned integer). Defaults to 6 (set automatically by new
). Sets the “Version” field to version. If no argument is given the current value is returned.
— Method ipv6:dscp dscp
Combined accessor and setter method for the “Differentiated Services Code Point” field (6 bit unsigned integer). Default is 0. This is a sub-field of the “Traffic Class” field. Sets the “Differentiated Services Code Point” field to dscp. If no argument is given the current value is returned.
— Method ipv6:ecn ecn
Combined accessor and setter method for the “Explicit Congestion Notification” (2 bit unsigned integer). Default is 0. This is a sub-field of the “Traffic Class” field. Sets the “Explicit Congestion Notification” field to ecn. If no argument is given the current value is returned.
— Method ipv6:payload_length length
Combined accessor and setter method for the “Payload Length” field (16 bit unsigned integer). Default is 0. Sets the “Payload Length” field to length. If no argument is given the current value is returned.
— Method ipv6:dst_eq ip
— Method ipv6:src_eq ip
Predicate methods to test if ip is equal to the source or destination addresses individually.
— Function ipv6:pton string
Returns the binary representation of IPv6 address denoted by string.
— Function ipv6:ntop ip
Returns the string representation of ip address.
— Function ipv6:solicited_node_mcast ip
Returns the solicited-node multicast address from the given unicast ip.
The lib.protocol.tcp
module contains a class for representing TCP headers.
— Method tcp:new config
Returns a new TCP header for config. Config must a be a table which may contain the following keys:
src_port
- “Source Port Number” field (16 bit unsigned integer). Default is 0.dst_port
- “Destination Port Number” field (16 bit unsigned integer). Default is 0.seq_num
- “Sequence Number” field (32 bit unsigned integer). Default is 0.ack_num
- “Acknowledgement Number” field (32 bit unsigned integer). Default is 0.window_size
- “Window Size” field (16 bit unsigned integer). Default is 0.offset
- “Data Offset” field (4 bit unsigned integer). Default is 0.ns
- “NS” flag (1 bit). Default is 0.cwr
- “CWR” flag (1 bit). Default is 0.ece
- “ECE” flag (1 bit). Default is 0.urg
- “URG” flag (1 bit). Default is 0.ack
- “ACK” flag (1 bit). Default is 0.psh
- “PSH” flag (1 bit). Default is 0.rst
- “RST” flag (1 bit). Default is 0.syn
- “SYN” flag (1 bit). Default is 0.fin
- “FIN” flag (1 bit). Default is 0.— Method tcp:src_port port
— Method tcp:dst_port port
— Method tcp:seq_num seq_num
— Method tcp:ack_num ack_num
— Method tcp:window_size window_size
— Method tcp:offset offset
— Method tcp:ns ns
— Method tcp:cwr cwr
— Method tcp:ece ece
— Method tcp:urg urg
— Method tcp:ack ack
— Method tcp:psh psh
— Method tcp:rst rst
— Method tcp:syn syn
— Method tcp:fin fin
Combined accessor and setter methods. These methods set the values of the instance fields (see new
) of a TCP header. If no argument is given the current value is returned.
— Method tcp:flags flags
Combined accessor and setter method for the TCP header flags (NS, CRW, ECE, URG, ACK, PSH, RST, SYN and FIN). Sets the header’s flags accoring to flags (9 bit unsigned intetger). If no argument is given the current flags are returned.
— Method tcp:checksum payload, length, ip
Computes and sets the “Checksum” field for length bytes of payload and optionally ip. If no argument is given the current value of the “Checksum” field is returned.
The lib.protocol.udp
module contains a class for representing UDP headers.
— Method udp:new config
Returns a new UDP header for config. Config must a be a table which may contain the following keys:
src_port
- “Source Port Number” field (16 bit unsigned integer). Default is 0.dst_port
- “Destination Port Number” field (16 bit unsigned integer). Default is 0.— Method udp:src_port port
— Method udp:dst_port port
Combined accessor and setter methods for the source and destination port fields. Sets the source or destination port individually. Returns the current port if called without arguments. Default is 8 (the UDP header length).
— Method udp:length length
Combined accessor and setter method for the “Length” field. Sets the “Length” field* to length (a 16 bit unsigned integer). If no argument is given the current value of the “Length” field is returned.
— Method udp:checksum payload, length, ip
Computes and sets the “Checksum” field for length bytes of payload and optionally ip. If no argument is given the current value of the “Checksum” field is returned.
The lib.protocol.gre
module contains a class for representing GRE headers. The gre
protocol class only supports the checksum and key extensions and the lib.protocol.ethernet
upper layer protocol.
— Method gre:new config
Returns a new GRE header for config. Config must a be a table which may contain the following keys:
protocol
- Upper layer protocol. May be 0x6558
(Ethernet). Default is nil
.checksum
- Set to true
to enable checksumming. Default is false
.key
- 32 bit unsigned integer. Enables keying if supplied. Default is nil
.— Method gre:checksum payload, length
Combined accessor and setter method for the checksum field. Computes and sets the checksum field for length bytes of payload. If no argument is given the current checksum is returned. Returns nil
if checksumming is disabled.
— Method gre:checksum_check payload, length
Predicate to verify length bytes of payload against the header checkum. Return nil
if checksumming is disabled.
— Method gre:key key
Combined accessor and setter method for the key field. Sets the key field to key. If no argument is given the current key is returned. Returns nil
if keying is disabled.
— Method gre:protocol protocol
Combined accessor and setter method for the upper layer protocol. Sets the upper layer protocol to protocol. If no argument is given the current upper layer protocol is returned.
The lib.protocol.icmp.header
module contains a class for representing ICMP headers. The icmp
protocol class currently supports two upper layer protocols: lib.protocol.icmp.nd.ns
and lib.protocol.icmp.nd.na
. These upper layer protocols implement the headers necessary to perform “Neighbor Discovery”.
— Method icmp:new type, code
Returns a new ICMP header of type which may be either 135 or 136 for lib.protocol.icmp.nd.ns
or lib.protocol.icmp.nd.na
respectively. Optionally code can be supplied to set the “Code” field for the type.
— Method icmp:type type
— Method icmp:code code
Combined accessor and setter methods. These methods set the values of the instance fields (see new
) of an ICMP header. If no argument is given the current value is returned.
— Method icmp:checksum payload, length, ipv6
Computes and sets the “Checksum” field for length bytes of payload. If the lower protocol layer is lib.protocol.ipv6
then ipv6 must be set to a true value.
— Method icmp:checksum_check payload, length, ipv6
Predicate to test if the header’s “Checksum” field matches length bytes of payload. If the lower protocol layer is lib.protocol.ipv6
then ipv6 must be set to a true value.
— Method ns:new target
Returns a new Neighbor Solicitation header. Target is the IP address used for the “Target Address” field.
— Method ns:target target
Combined accessor and setter method for the “Target Address” field. Sets the “Target Address” field to target. If no argument is given the current value is returned.
— Method ns:target_eq target
Predicate to test if the header’s value in the “Target Address” field is equivalent to target.
— Method na:new target, router, solicited, override
Returns a new Neighbor Advertisement header. Target is the IP address used for the “Target Address” field. Router, solicited and override can be boolean values to set the “Router”, “Solicited” and “Override” flags respectively. The default for the flags is 0.
— Method ns:target target
— Method ns:router router
— Method ns:solicited solicited
— Method ns:override override
Combined accessor and setter methods. These methods set the values of the instance fields (see new
) of an Neighbor Advertisement header. If no argument is given the current value is returned.
— Method ns:target_eq target
Predicate to test if the header’s value in the “Target Address” field is equivalent to target.
Both Neighbor Solicitation and Advertisement (lib.protocol.icmp.nd.ns
and lib.protocol.icmp.nd.na
) headers implement an options
method for parsing TLV Options contained in the their payloads.
Example:
-- Parse datagram with ICMP/NA packet
local na = dgram:parse()
-- Parse TLV Options
local options = na:options(dgram:payload())
— Method nd:options payload, length
Parses and returns an array of TLV Options (see lib.protocol.icmp.nd.options.tlv
) from length bytes of payload.
The lib.protocol.icmp.nd.options.tlv
module contains a class for representing TLV Options. Currently only two types of options are implemented: “Source Link-Layer Address” ("src_ll_addr"
) and “Target Link-Layer Address” ("tgt_ll_address"
). Both are represented by the lladdr
class (see lib.protocol.icmp.nd.options.lladdr
).
— Method tlv:new type, data
Returns a new TLV Option object for data of type. Type may be either 1 for “Source Link-Layer Address” or 2 for “Target Link-Layer Address”. Data must be a lladdr
object.
— Method tlv:name
Returns a string denoting the type of the option. Either "src_ll_addr"
for “Source Link-Layer Address” or "tgt_ll_address"
for “Target Link-Layer Address”.
— Method tlv:length
Returns the the size of the TLV Option as multiples of 8 bytes.
— Method tlv:type type
Combined accessor and setter method. Sets the type field (see new
) to type. If no argument is given the current value of the type field is returned.
— Method tlv:option
Returns an object of the class denoted by the type field. Currently that only includes lladdr
instances.
The lib.protocol.icmp.nd.options.lladdr
module contains a class for representing Link-Layer Address Options.
— Method lladdr:new address
Returns a new Link-Layer Option object for MAC address in binary representation.
— Method lladdr:name
Returns the string "ll_addr"
.
— Method lladdr:addr address
Combined accessor and setter method. Sets the address field (see new
) to address. If no argument is given the current value of the address field is returned.
The lib.protocol.datagram
module provides basic mechanisms for parsing, building and manipulating a hierarchy of protocol headers and the associated payload contained in a data packet. In particular, it supports:
It mediates between packets as defined in core.packet
and protocol classes which are defined as classes derived from the protocol header base class in the lib.protocol.header
module.
The contents of a datagram instance are logically divided into three areas: The payload, parsed headers and pushed headers. The datagram payload is a sequence of bytes either inherited from the packet given to datagram:new
or appended using datagram:payload
. The headers in the payload can be parsed using datagram:parse_match
, which will shrink the payload by the header. Finally, synthetic headers can be prepended to the datagram using datagram:push
. To get the whole datagram as a packet use datagram:packet
.
A datagram can be used in two modes of operation, called “immediate commit” and “delayed commit”. In immediate commit mode, the push
and pop
methods immediately modify the underlying packet. However, this can be undesireable.
Even though the manipulations are relatively fast by using SIMD instructions to move and copy data when possible, performance-aware applications usually try to avoid as much of them as possible. This creates a conflict if the caller performs operations to push or parse a sequence of protocol headers in immediate commit mode.
This problem can be avoided by using delayed commit mode. In this mode, the push
methods add the data to a separate buffer as intermediate storage. The buffer is prepended to the actual packet in a single operation by calling datagram:commit
.
The pop
methods are made light-weight in delayed commit mode as well by keeping track of an additional offset that indicates where the actual packet starts in the packet buffer. Each call to one of the pop
methods simply increases the offset by the size of the popped piece of data. The accumulated actions will be applied as a single operation by datagram:commit
.
The push
and pop
methods can be freely mixed in delayed commit mode.
Due to the destructive nature of these methods in immediate commit mode, they cannot be applied when the parse stack is not empty, because moving the data in the packet buffer will invalidate the parsed headers. The push
and pop
methods will raise an error in that case.
The buffer used in delayed commit mode has a fixed size of 512 bytes. This limits the size of data that can be pushed in a single operation. A sequence of push/commit operations can be used to push an arbitrary amount of data in chunks of up to 512 bytes.
— Method datagram:new packet, protocol, options
Creates a datagram for packet or from scratch if packet is nil
. Protocol will be used by parse_match
to parse the packet payload. If protocol is not nil
it is set as the initial upper layer protocol. If options is not nil
it must be a table that selects configurable properties of the class. Currently, the only option is the selection of immediate or delayed commit mode by setting the key delayed_commit
to false
or true
, respectively. The default is immediate commit mode.
— Method datagram:push header
Prepends header to the front of the datagram. This method destructively modifies the underlying packet in immediate commit mode and raises an error if the parse stack is not empty.
In delayed commit mode, header is prepended to an intermediate buffer.
— Method datagram:push_raw data, length
This method behaves like the datagram:push method for an arbitrary chunk of memory of length length located at the address pointed to by data.
— Method datagram:parse_match protocol, check
Attempts to parse the next header in the datagram, thereby removing it from the payload. Returns a header instance of class protocol on success. If protocol is nil
the current upper layer protocol as set by datagram:new
or previous calls to parse_match
is used.
If neither protocol nor the upper layer protocol is set or the constructor of the protocol class returns nil
, the parsing operation has failed and parse_match
returns nil
. The datagram remains unchanged.
If the protocol class instance has been created successfully, it is passed as single argument to the anonymous function check.
If check returns a false value, the parsing has failed and parse_match
returns nil
. The packet remains unchanged.
If check is not supplied or if it returned a true value, the parsing has succeeded and the current upper layer protocol of the datagram is set to the value returned by header:upper_layer
.
— Method datagram:parse protocols_and_checks
A wrapper around parse_match
that allows parsing of a sequence of headers with a single method call.
If protocols_and_checks is a sequence of protocol class and check function pairs, parse_match
is called for each pair. Returns the header object of the last header parsed or nil
if any of the calls to parse_match
return nil
.
If called with a nil
argument, this method is equivalent to parse_match
called without arguments.
— Method datagram:parse_n n
A wrapper around parse_match
that parses the next n protocol headers using the current upper layer protocol and subsequent values of header:upper_layer
. It returns the last header object or nil
if less than n headers could be parsed successfully.
— Method datagram:unparse n
Undoes the last n calls to parse_match
on the datagram. E.g. prepends n parsed headers back to the payload. The sequence of parsed headers can be obtained by calling stack
.
— Method datagram:pop n
Removes the leading n parsed headers from the datagram. Note that headers added via push
can not be removed using pop
. The caller has to ensure that the datagram contains at least n headers that were parsed using parse_match
. The sequence of parsed headers can be obtained by calling stack
. This method destructively modifies the underlying packet in immediate commit mode and raises an error if the parse stack is not empty.
In delayed commit mode, the packet is not modified and the parse stack remains valid.
For instance let d be an datagram with an Ethernet header followed by an IPv6 header. Assuming we have parsed both headers using d:parse_n(2)
, we could call d:pop(1)
to decapsulate the IPv6 packet from its Ethernet header.
— Method datagram:pop_raw length, ulp
Removes length bytes from the beginning of the datagram. If ulp is given it is set as the current upper layer protocol. This method destructively modifies the underlying packet in immediate commit mode and raises an error if the parse stack is not empty.
In delayed commit mode, the packet is not modified and the parse stack remains valid.
— Method datagram:stack
Returns the parsed header objects as a sequence.
— Method datagram:packet
Returns a packet (see core.packet
) containing the datagram (including pushed headers).
— Method datagram:payload pointer, length
Combined payload accessor and setter method. Returns a pointer to the datagram payload and its byte size.
If pointer and length are supplied then length bytes starting from pointer are appended to the datagram’s payload.
— Method datagram:data
Returns data
and length
of the underlying packet.
If called in delayed commit mode, the operations accumulated by the push
and pop
methods since the creation of the datagram or the last invocation of datagram:commit are commited to the underlying packet. An error is raised if the parse stack is not empty.
The method can be safely called in immediate commit mode.
The lib.ipsec.esp
module contains two classes encrypt
and decrypt
which implement packet encryption and decryption with IPsec ESP in both tunnel and transport modes. Currently, the only supported cipher is AES-GCM with 128‑bit keys, 4 bytes of salt, and a 16 byte authentication code. These classes do not implement any key exchange protocol.
Note: the classes in this module do not reject IP fragments of any sort.
References:
— Method encrypt:new config
— Method decrypt:new config
Returns a new encryption/decryption context respectively. Config must a be a table with the following keys:
aead
- AEAD identifier (string). The only accepted value is "aes-gcm-16-icv"
(AES-GCM with a 16 byte ICV).spi
- A 32 bit integer denoting the “Security Parameters Index” as specified in RFC 4303.key
- Hexadecimal string of 32 digits (two digits for each byte, most significant digit first) that denotes 16 bytes of high-entropy key material as specified in RFC 4106.salt
- Hexadecimal string of eight digits (two digits for each byte) that denotes four bytes of salt as specified in RFC 4106.window_size
- Optional. Minimum width of the window in which out of order packets are accepted as specified in RFC 4303. The default is 128. (decrypt
only.)resync_threshold
- Optional. Number of consecutive packets allowed to fail decapsulation before attempting “Re-synchronization” as specified in RFC 4303. The default is 1024. (decrypt
only.)resync_attempts
- Optional. Number of attempts to re-synchronize a packet that triggered “Re-synchronization” as specified in RFC 4303. The default is 8. (decrypt
only.)auditing
- Optional. A boolean value indicating whether to enable or disable “Auditing” as specified in RFC 4303. The default is nil
(no auditing). (decrypt
only. Note: source address, destination address and flow ID are only logged when using decapsulate_transport6
.)In tunnel mode, encapsulation accepts packets of any format and wraps them in an ESP frame, encrypting the original packet contents. Decapsulation reverses the process: it removes the ESP frame and returns the original input packet.
— Method encrypt:encapsulate_tunnel packet, next_header
Encapsulates packet and encrypts its payload. The ESP header’s Next Header field is set to next_header. Takes ownership of packet and returns a new packet.
— Method decrypt:decapsulate_transport6 packet
Decapsulates packet and decrypts its payload. On success, takes ownership of packet and returns a new packet and the value of the ESP header’s Next Header field. Otherwise returns nil
.
In transport mode, encapsulation accepts IPv6 packets and inserts a new ESP header between the outer IPv6 header and the inner protocol header (e.g. TCP, UDP, L2TPv3) and also encrypts the contents of the inner protocol header. Decapsulation does the reverse: it decrypts the inner protocol header and removes the ESP protocol header. In this mode it is expected that an Ethernet header precedes the outer IPv6 header.
— Method encrypt:encapsulate_transport6 packet
Encapsulates packet and encrypts its payload. On success, takes ownership of packet and returns a new packet. Otherwise returns nil
.
— Method decrypt:decapsulate_transport6 packet
Decapsulates packet and decrypts its payload. On success, takes ownership of packet and returns a new packet. Otherwise returns nil
.
The program.snabbnfv.nfvconfig
module implements a Network Functions Virtualization component based on Snabb. It introduces a simple configuration file format to describe NFV configurations which it then compiles to app networks. This NFV component is compatible with OpenStack Neutron.
— Function nfvconfig.load file, pci_address, socket_path
Loads the NFV configuration from file and compiles an app network using pci_address and socket_path for the underlying NIC driver and VhostUser
apps. Returns the resulting engine configuration.
The configuration file format understood by program.snabbnfv.nfvconfig
is based on Lua expressions. Initially, it contains a list of NFV ports:
return { <port-1>, ..., <port-n> }
Each port is defined by a range of properties which correspond to the configuration parameters of the underlying apps (NIC driver, VhostUser
, PcapFilter
, RateLimiter
, nd_light
and SimpleKeyedTunnel
):
port := { port_id = <id>, -- A unique string
mac_address = <mac-address>, -- MAC address as a string
vlan = <vlan-id>, -- ..
ingress_filter = <filter>, -- A pcap-filter(7) expression
egress_filter = <filter>, -- ..
tunnel = <tunnel-conf>,
crypto = <crypto-conf>,
rx_police = <n>, -- Allowed input rate in Gbps
tx_police = <n> } -- Allowed output rate in Gbps
The tunnel
section deviates a little from SimpleKeyedTunnel
’s terminology:
tunnel := { type = "L2TPv3", -- The only type (for now)
local_cookie = <cookie>, -- As for SimpleKeyedTunnel
remote_cookie = <cookie>, -- ..
next_hop = <ip-address>, -- Gateway IP
local_ip = <ip-address>, -- ~ `local_address'
remote_ip = <ip-address>, -- ~ `remote_address'
session = <32bit-int> } -- ~ `session_id'
The crypto
section allows configuration of traffic encryption based on apps.ipsec.esp
:
crypto := { type = "esp-aes-128-gcm", -- The only type (for now)
spi = <spi>, -- As for apps.ipsec.esp
transmit_key = <key>,
transmit_salt = <salt>,
receive_key = <key>,
receive_salt = <salt>,
auditing = <boolean> }
The snabbnfv traffic
program loads and runs a NFV configuration using program.snabbnfv.nfvconfig
. It can be invoked like so:
./snabb snabbnfv traffic <file> <pci-address> <socket-path>
snabbnfv traffic
runs the loaded configuration indefinitely and automatically reloads the configuration file if it changes (at most once every second).
The snabbnfv neutron2snabb
program converts Neutron database CSV dumps to the format used by program.snabbnfv.nfvconfig
. For more info see Snabb NFV Architecture. It can be invoked like so:
./snabb snabbnfv neutron2snabb <csv-directory> <output-directory> [<hostname>]
snabbnfv neutron2snabb
reads the Neutron configuration csv-directory and translates them to one lib.nfv.conig
configuration file per physical network. If hostname is given, it overrides the hostname provided by hostname(1)
.
Snabb Switch program for overlaying Ethernet networks on the IPv6 Internet or a local IPv6 network. For transporting L2 networks over the Internet, LISPER requires the use of external LISP (RFC 6830) controllers.
LISPER transports L2 networks over an IPv6 network by connecting together Ethernet networks and L2TPv3 point-to-point tunnels that are on different locations on the transport network.
Each location runs an instance of LISPER and an instance of a LISP controller to which multiple network interfaces can be connected.
Some of the interfaces can connect to physical Ethernet networks, others can connect to IPv6 networks (routed or not). The IPv6 interfaces carry packets to/from L2TPv3 tunnels and to/from remote LISPER instances. The same IPv6 interface can connect to multiple tunnels and/or LISPER instances so a single interface is sufficient to connect everything at one location, unless there are direct Etherent networks which need connecting too which require separate interfaces.
LISPER can work with to any Linux eth interface via raw sockets or it can use its built-in Intel10G driver to work with Intel 82599 network cards directly. The Intel10G driver also supports 802.1Q which allows multiple virtual interfaces to be configured on a single network card.
https://github.com/capr/snabbswitch/archive/master.zip
make
Tested on Ubuntu 14.04 and NixOS 15.09.
cd src/program/lisper/dev-env
./net-bringup # create a test network and start everything
./ping-all # run ping tests
./net-bringdown # kill everything and clean up
NOTE: The test network creates network namespaces r2
and nodeN
where N=01..08
so make sure you don’t use these namespaces already.
src/snabb lisper -c <config.file>
The config file is a JSON file that looks like this:
{
"control_sock" : "/var/tmp/lisp-ipc-map-cache04",
"punt_sock" : "/var/tmp/lispers.net-itr04",
"arp_timeout" : 60, // seconds
"interfaces": [
{ "name": "e0", "mac": "00:00:00:00:01:04",
"pci": "0000:05:00.0", "vlan_id": 2 },
{ "name": "e03", "mac": "00:00:00:00:01:03" },
{ "name": "e13", "mac": "00:00:00:00:01:13" }
],
"exits": [
{ "name": "e0", "ip": "fd80:4::2", "interface": "e0",
"next_hop": "fd80:4::1" }
],
"lispers": [
{ "ip": "fd80:8::2", "exit": "e0" }
],
"local_networks": [
{ "iid": 1, "type": "L2TPv3", "ip": "fd80:1::2", "exit": "e0",
"session_id": 1, "cookie": "" },
{ "iid": 1, "type": "L2TPv3", "ip": "fd80:2::2", "exit": "e0",
"session_id": 2, "cookie": "" },
{ "iid": 1, "interface": "e03" },
{ "iid": 1, "interface": "e13" }
]
}
Connectivity with the LISP controller requires control_sock
and punt_sock
, two named sockets that must be the same sockets that the LISP controller was configured with. These can be skipped if there’s no LISP controller.
interface
is an array defining the physical interfaces. name
and mac
are required. If pci
is given, the Intel10G driver is used. If vlan_id
is given, the interface is assumed to be a 802.1Q trunk.
exits
is an array defining the IPv6 exit points (if any) which are used for connecting to remote LISPER instances and to L2TPv3 tunnels. name
, ip
, interface
, next_hop
are all required fields.
lispers
is an array defining remote LISPER instances, if any. ip
and exit
are required.
local_networks
is an array defining the local L2 networks connected to this LISPER instance. These can be either local networks (in which case only interface
is required) or L2TPv3 end-points (in which case type
must be “L2TPv3”, and ip
, session_id
, cookie
and exit
are required).
–
cd src/program/lisper/dev-env
./net-bringup # create a test network and start everything
./net-bringup-intel10g # create a test network using Intel10G cards
./ping-all # run ping tests
./nsnode N # get a shell in the network namespace of a node
./nsr2 # get a shell in the network namespace of R2
./net-teardown # kill everything and clean up
NOTE: net-bringup-intel10g
requires 4 network cards with loopback cables between cards 1,2 and 3,4. Edit the script to set their names and PCI addresses and also edit lisperXX.conf.intel10g
config files and change the pci
and vlan_id
fields as needed. You can find the PCI addresses of the cards in your machine with lspci | grep 82599
.
./ping-all
sends 2000 IPv4 pings 1000-byte each between various nodes. It’s output should look like this:
l2tp-l2tp
2000 packets transmitted, 2000 received, 0% packet loss, time 443ms
2000 packets transmitted, 2000 received, 0% packet loss, time 603ms
l2tp-eth
2000 packets transmitted, 2000 received, 0% packet loss, time 358ms
2000 packets transmitted, 2000 received, 0% packet loss, time 502ms
eth-l2tp
2000 packets transmitted, 2000 received, 0% packet loss, time 354ms
2000 packets transmitted, 2000 received, 0% packet loss, time 507ms
l2tp-lisper-l2tp
2000 packets transmitted, 2000 received, 0% packet loss, time 1026ms
2000 packets transmitted, 2000 received, 0% packet loss, time 1037ms
eth-lisper-eth
2000 packets transmitted, 2000 received, 0% packet loss, time 926ms
2000 packets transmitted, 2000 received, 0% packet loss, time 876ms
The test network is comprised of multiple network nodes that are all connected to an R2 IPv6 router. The nodes are in different network namespaces and are assigned IPs in different IPv6 subnets to simulate physical locations.
Node namespaces are named nodeXX
where XX is 01, 02, 04, 05, 06 and 08. The router lives in the r2
namespace.
Nodes 01, 02, 05, 06 each contain both endpoints of an L2TPv3 tunnel.
Nodes 04, 08 each contain one LISPER instance and one local Ethernet network.
Each node has at least one interface in the L2 overlay network with ip 10.0.0.N/24. You should be able to ping between any of them (see ping-all
).
Note the speed differences between nodes. The worst case is if you go to node 01 (which contains 10.0.0.1 which is a L2TPv3 tunnel) and from there ping 10.0.0.5 (which is itself on a L2TPv3 tunnel on a remote LISPER).
arp_timeout
config option is not followed.Example Snabb program for prototyping multi-process YANG-based network functions.
The lib.ptree
facility in Snabb allows network engineers to build a network function out of a tree of processes described by a YANG schema. The root process runs the management plane, and the leaf processes (the “workers”) run the data plane. The apps and links in the workers are declaratively created as a function of a YANG configuration.
This snabb ptree
program is a tool to allow quick prototyping of network functions using the ptree facilities. The invocation syntax of snabb ptree
is as follows:
snabb ptree [OPTION...] SCHEMA.YANG SETUP.LUA CONF
The schema.yang file contains a YANG schema describing the network function’s configuration. setup.lua defines a Lua function mapping a configuration to apps and links for a set of worker processes. conf is the initial configuration of the network function.
Let’s say we’re going to make a packet filter application. We can use Snabb’s built-in support for filters expressed in pflang, the language used by tcpdump
, and just hook that filter up to a full-duplex NIC.
To begin with, we have to think about how to represent the configuration of the network function. If we simply want to be able to specify the PCI device of a NIC, an RSS queue, and a filter string, we could describe it with a YANG schema like this:
module snabb-pf-v1 {
namespace snabb:pf-v1;
prefix pf-v1;
leaf device { type string; mandatory true; }
leaf rss-queue { type uint8; default 0; }
leaf filter { type string; default ""; }
}
We throw this into a file pf-v1.yang
. In YANG, a module
’s body contains configuration declarations, most importantly leaf
, container
, and list
. In our snabb-pf-v1
schema, there is a module
containing three leaf
s: device
, rss-queue
, and filter
. Snabb effectively generates a validating parser for configurations following this YANG schema; a configuration file must contain exactly one device FOO;
declaration and may contain one rss-queue
statement and one filter
statement. Thus a concrete configuration following this YANG schema might look like this:
device 83:00.0;
rss-queue 0;
filter "tcp port 80";
So let’s just drop that into a file pf-v1.cfg
and use that as our initial configuration.
Now we just need to map from this configuration to app graphs in some set of workers. The setup.lua file should define this function.
-- Function taking a snabb-pf-v1 configuration and
-- returning a table mapping worker ID to app graph.
return function (conf)
-- Write me :)
end
The conf
parameter to the setup function is a Lua representation of config data for this network function. In our case it will be a table containing the keys device
, rss_queue
, and filter
. (Note that Snabb’s YANG support maps dashes to underscores for the Lua data, so it really is rss_queue
and not rss-queue
.)
The return value of the setup function is a table whose keys are “worker IDs”, and whose values are the corresponding app graphs. A worker ID can be any Lua value, for example a number or a string or whatever. If the user later reconfigures the network function (perhaps setting a different filter string), the manager will re-run the setup function to produce a new set of worker IDs and app graphs. The manager will then stop workers whose ID is no longer present, start new workers, and reconfigure workers whose ID is still present.
In our case we’re just going to have one worker, so we can use any worker ID. If the user reconfigures the filter but keeps the same device and RSS queue, we don’t want to interrupt packet flow, so we want to use a worker ID that won’t change. But if the user changes the device, probably we do want to restart the worker, so maybe we make the worker ID a function of the device name.
With all of these considerations, we are ready to actually write the setup function.
local app_graph = require('core.config')
local pci = require('lib.hardware.pci')
local pcap_filter = require('apps.packet_filter.pcap_filter')
-- Function taking a snabb-pf-v1 configuration and
-- returning a table mapping worker ID to app graph.
return function (conf)
-- Load NIC driver for PCI address.
local device_info = pci.device_info(conf.device)
local driver = require(device_info.driver).driver
-- Make a new app graph for this configuration.
local graph = app_graph.new()
app_graph.app(graph, "nic", driver,
{pciaddr=conf.device, rxq=conf.rss_queue,
txq=conf.rss_queue})
app_graph.app(graph, "filter", pcap_filter.PcapFilter,
{filter=conf.filter})
app_graph.link(graph, "nic."..device_info.tx.." -> filter.input")
app_graph.link(graph, "filter.output -> nic."..device_info.rx)
-- Use DEVICE/QUEUE as the worker ID.
local id = conf.device..'/'..conf.rss_queue
-- One worker with the given ID and the given app graph.
return {[id]=graph}
end
Put this in, say, pf-v1.lua
, and we’re good to go. The network function can be run like this:
$ snabb ptree --name my-filter pf-v1.yang pf-v1.lua pf-v1.cfg
See snabb ptree --help
for full details on arguments like --name
.
The snabb ptree
program also takes a number of options that apply to the data-plane processes.
— –cpu cpus
Allocate cpus to the data-plane processes. The manager of the process tree will allocate CPUs from this set to data-plane workers. For example, For example, --cpu 3-5,7-9
assigns CPUs 3, 4, 5, 7, 8, and 9 to the network function. The manager will try to allocate a CPU for a worker that is NUMA-local to the PCI devices used by the worker.
— –real-time
Use the SCHED_FIFO
real-time scheduler for the data-plane processes.
— –on-ingress-drop action
If a data-plane process detects too many dropped packets (by default, 100K packets over 30 seconds), perform action. Available actions are flush
, which tells Snabb to re-optimize the code; warn
, which simply prints a warning and raises an alarm; and off
, which does nothing.
The manager of a ptree-based Snabb network function also listens to configuration queries and updates on a local socket. The user-facing side of this interface is snabb config
. A snabb config
user can address a local ptree network function by PID, but it’s easier to do so by name, so the above example passed --name my-filter
to the snabb ptree
invocation.
For example, we can get the configuration of a running network function with snabb config get
:
$ snabb config get my-filter /
device 83:00.0;
rss-queue 0;
filter "tcp port 80";
You can also update the configuration. For example, to move this network function over to device 82:00.0
, do:
$ snabb config set my-filter /device 82:00.0
$ snabb config get my-filter /
device 82:00.0;
rss-queue 0;
filter "tcp port 80";
The ptree manager takes the necessary actions to update the dataplane to match the specified configuration.
Let’s say your clients are really loving this network function, so much so that they are running an instance on each network card on your server. Whenever the filter string updates though they are getting tired of having to snabb config set
all of the different processes. Well you can make them even happier by refactoring the network function to be multi-process.
module snabb-pf-v2 {
namespace snabb:pf-v2;
prefix pf-v2;
/* Default filter string. */
leaf filter { type string; default ""; }
list worker {
key "device rss-queue";
leaf device { type string; }
leaf rss-queue { type uint8; }
/* Optional worker-specific filter string. */
leaf filter { type string; }
}
}
Here we declare a new YANG model that instead of having one device and RSS queue, it has a whole list of them. The key "device rss-queue"
declaration says that the combination of device and RSS queue should be unique – you can’t have two different workers on the same device+queue pair, logically. We declare a default filter
at the top level, and also allow each worker to override with their own filter declaration.
A configuration might look like this:
filter "tcp port 80";
worker {
device 83:00.0;
rss-queue 0;
}
worker {
device 83:00.0;
rss-queue 1;
}
worker {
device 83:00.1;
rss-queue 0;
filter "tcp port 443";
}
worker {
device 83:00.1;
rss-queue 1;
filter "tcp port 443";
}
Finally, we need a new setup function as well:
local app_graph = require('core.config')
local pci = require('lib.hardware.pci')
local pcap_filter = require('apps.packet_filter.pcap_filter')
-- Function taking a snabb-pf-v2 configuration and
-- returning a table mapping worker ID to app graph.
return function (conf)
local workers = {}
for k, v in pairs(conf.worker) do
-- Load NIC driver for PCI address.
local device_info = pci.device_info(k.device)
local driver = require(device_info.driver).driver
-- Make a new app graph for this worker.
local graph = app_graph.new()
app_graph.app(graph, "nic", driver,
{pciaddr=k.device, rxq=k.rss_queue,
txq=k.rss_queue})
app_graph.app(graph, "filter", pcap_filter.PcapFilter,
{filter=v.filter or conf.filter})
app_graph.link(graph, "nic."..device_info.tx.." -> filter.input")
app_graph.link(graph, "filter.output -> nic."..device_info.rx)
-- Use DEVICE/QUEUE as the worker ID.
local id = k.device..'/'..k.rss_queue
-- Add worker with the given ID and the given app graph.
workers[id] = graph
end
return workers
end
If we place these into analogously named files, we have a multiprocess network function:
$ snabb ptree --name my-filter pf-v2.yang pf-v2.lua pf-v2.cfg
If you change the root filter string via snabb config
, it propagates to all workers, except those that have their own overrides of course:
$ snabb config set my-filter /filter "'tcp port 666'"
$ snabb config get my-filter /filter
"tcp port 666"
The syntax to get at a particular worker is a little gnarly; it’s based on XPath, for compatibility with existing NETCONF NCS systems. See the snabb config
documentation for full details.
$ snabb config get my-filter '/worker[device=83:00.1][rss-queue=1]'
filter "tcp port 443";
You can stop a worker with snabb config remove
:
$ snabb config remove my-filter '/worker[device=83:00.1][rss-queue=1]'
$ snabb config get my-filter /
filter "tcp port 666";
worker {
device 83:00.0;
rss-queue 0;
}
worker {
device 83:00.0;
rss-queue 1;
}
worker {
device 83:00.1;
rss-queue 0;
filter "tcp port 443";
}
Start up a new one with snabb config add
:
$ snabb config add my-filter /worker <<EOF
{
device 83:00.1;
rss-queue 1;
filter "tcp port 8000";
}
EOF
Voilà! Now your clients will think you are a wizard!
The lib.watchdog.watchdog
module implements a per-thread watchdog functionality. Its purpose is to watch and kill processes which fail to call the watchdog periodically (e.g. hang).
It does so by using alarm(3) and ualarm(3) to have the OS send a SIGALRM to the process after a specified timeout. Because the process does not handle the signal it will be killed and exit with status 142.
— Function watchdog.set milliseconds
Set watchdog timeout to milliseconds. Values for milliseconds greater than 1,000 are truncated to the next second. For example:
watchdog.set(1100) == watchdog.set(2000)
— Function watchdog.reset
Starts the timout if the watchdog has not yet been started and resets the timeout otherwise. If the timeout is reached the process will be killed.
— Function watchdog.stop
Disables the timeout.
Servers devoted to the Snabb project and usable by all known developers.
Want to be a known developer? Sure! Just edit the user account list with your user and send a pull request. No fuss.
sudo lock ./snabb ...
. The lock
command will automatically wait if somebody else is running a Snabb process on the same machine and that helps us avoid conflicts for access to hardware resources.luke@snabb.co
your email address(es) to get an invitation to the Lab Slack.Name | Purpose | SSH | Intel CPU | NICs |
---|---|---|---|---|
lugano-1 | General use | lugano-1.snabb.co | E5 1650v3 | 2 x 10G (82599), 4 x 10G (X710), 2 x 40G (XL710) |
lugano-2 | General use | lugano-2.snabb.co | E5 1650v3 | 2 x 10G (82599), 4 x 10G (X710), 2 x 40G (XL710) |
lugano-3 | General use | lugano-3.snabb.co | E5 1650v3 | 2 x 10G (82599), 2 x 100G (ConnectX-4) |
lugano-4 | General use | lugano-4.snabb.co | E5 1650v3 | 2 x 10G (82599), 2 x 100G (ConnectX-4) |
davos | Continuous Integration tests & driver development | lab1.snabb.co port 2000 | 2x E5 2603 | Diverse 10G/40G: Intel, SolarFlare, Mellanox, Chelsio, Broadcom. Installed upon request. |
grindelwald | Snabb NFV testing | lab1.snabb.co port 2010 | 2x E5 2697v2 | 12 x 10G (Intel 82599) |
interlaken | Haswell/AVX2 testing | lab1.snabb.co port 2030 | 2x E5 2620v3 | 12 x 10G (Intel 82599) |
murren-* | Hydra fleet for tests without NICs | (none) | i7-6700 | (none) |
You are welcome to play, test, and develop on the lugano-1
.. lugano-4
servers. Once your account is added you can connect like this:
$ ssh user@lugano-1.snabb.co
and check the PCI devices and their addresses with lspci
.
Certain cards (82599 and ConnectX-4) are cabled to themselves. That is, dual-port cards have their ports connected to each other. Certain other cards (X710/XL710) are currently not cabled. If you have special cabling needs then please open an issue on the snabblab-nixos.
All servers run the latest stable version of NixOS Linux distribution.
To quickly install a package:
$ nox <search string>
For other operations such as uninstalling a package, refer to man nix-env
.
If you have any questions or trouble, ask on the #lab channel or open an issue.
We are grateful to Silicom for their sponsorship in the form of discounted network cards for chur
and to Netgate for giving us jura
. Thanks gang!