Snabb Reference Manual

Mike Pall

Luke Gorrie

Andy Wingo

Max Rottenkolber

Diego Pino Garcia

Asumu Takikawa

Jessica Tallon

Diego Pino

Alexander Gall

Cosmin Apreutesei

Nikolay Nikolaev

Katerina Barone-Adesi

Hans Huebner

Javier Guerra

Marcel Wiget

Nicola Larosa

Pete Bristow

Adrian Perez de Castro

Adrián Pérez de Castro

Nicola ‘tekNico’ Larosa

Antonio Nikishaev

Domen Kožar

Peter Cawley

Alexander Altshuler

Rolf Sommerhalder

Timo Buhrmester

Pete Kazmier

Dibyendu Majumdar

Lesley De Cruz

Mikhail Nazarov

Kristian Larsson

Justin Cormack

Ben Agricola

Felix Geißler

R. Matthew Emerson

Michael G

Christian Graf

Carlos Alberto Lopez Perez

Jeff Loughridge

Kacper Wysocki

Adrian Perez

William Adams

Vladimir Fedin

Vincenzo Maffione

Tomas Korcak

Tim Upthegrove

Tim LaBerge

Stevan Markovic

Simon Leinen

Ryan Hartlage

Kristian Kielhofner

Jon Olsson

Jianbo Liu

Jay Fenton

James Cunningham

Hui Xiang

Gernot Nusshall

Fabian Bonk

Edward Hope-Morley

Darius Bacon

Anshul Makkar

Andy Chong

Alex Kordic

Alexandr Kostrikov

Alexander Spyridakis

Version 4b0c18b, Fri Nov 8 08:53:57 2019 +0100

Introduction
Snabb API
- App
- Config (core.config)
- Engine (core.app)
- Link (core.link)
- Packet (core.packet)
- Memory (core.memory)
- Shared Memory (core.shm)
  - Counter (core.counter)
  - Histogram (core.histogram)
- Lib (core.lib)
- Multiprocess operation (core.worker)
- Main
Basic Apps (apps.basic.basic_apps)
- Source
- Join
- Split
- Sink
- Tee
- Repeater
- Truncate
- Sample
Intel 82599 Ethernet Controller Apps
- Intel82599 (apps.intel.intel_app)
- LoadGen (apps.intel.loadgen)
  - Configuration
  - Performance
Intel i210 / i350 / 82599 Ethernet Controller apps (apps.intel_mp.intel_mp)
- Caveats
- Configuration
Solarflare Ethernet Controller Apps
- Solarflare (apps.solarflare.solarflare)
  - Configuration
RateLimiter App (apps.rate_limiter.rate_limiter)
- Configuration
- Performance
PcapFilter App (apps.packet_filter.pcap_filter)
- Configuration
- Special Counters
IPv4 Apps
- ARP (apps.ipv4.arp)
  - Configuration
- Reassembler (apps.ipv4.reassemble)
  - Configuration
- Fragmenter (apps.ipv4.fragment)
  - Configuration
- ICMP Echo responder (apps.ipv4.echo)
  - Configuration
IPv6 Apps
- Nd_light (apps.ipv6.nd_light)
  - Configuration
  - Special Counters
- SimpleKeyedTunnel (apps.keyed_ipv6_tunnel.tunnel)
  - Configuration
  - Special Counters
- Fragmenter (apps.ipv6.fragment)
  - Configuration
- ICMP Echo responder (apps.ipv6.echo)
  - Configuration
VhostUser App (apps.vhost.vhost_user)
- Configuration
VirtioNet App (apps.virtio_net.virtio_net)
- Configuration
Pcap Savefile Apps
- PcapReader and PcapWriter Apps (apps.pcap.pcap)
  - Configuration
- Tap (apps.pcap.tap)
  - Configuration
RawSocket App (apps.socket.raw)
- Configuration
UnixSocket App (apps.socket.unix)
- Configuration
Tap app (apps.tap.tap)
- Configuration
VLAN Apps
- Tagger (apps.vlan.vlan)
  - Configuration
- Untagger (apps.vlan.vlan)
  - Configuration
- VlanMux (apps.vlan.vlan)
  - Configuration
Bridge Apps
- Configuration
- Flooding bridge (apps.bridge.flooding)
  - Configuration
- Learning bridge (apps.bridge.learning)
  - Configuration
IPFIX and NetFlow apps
- IPFIX (apps.ipfix.ipfix)
  - Configuration
  - To-do list
IPsec Apps
- ESP Transport6 and Tunnel6 (apps.ipsec.esp)
  - Configuration
Test Apps
- Match (apps.test.match)
  - Configuration
- Synth (apps.test.synth)
  - Configuration
- Npackets (apps.test.npackets)
  - Configuration
SnabbWall Apps
- L7Spy (apps.wall.l7spy)
- Filter (apps.wall.filter)
- Scanner (apps.wall.scanner)
  - Subclassing
  - NdpiScanner (apps.wall.scanner.ndpi)
- Utilities
  - SouthAndNorth (apps.wall.util)
RSS app (apps.rss.rss)
- Flow-director
- Packet replication
- Weighted links
- Packet meta-data
- IPv6 extension header elimination
- VLAN pseudo-tagging
- Configuration
- Meta-data API
Inter-process links (apps.interlink.*)
- Configuration
Libraries
- IP checksum (lib.checksum)
- Ctable (lib.ctable)
- Poptrie (lib.poptrie)
- PMU (lib.pmu)
- Snabb program configuration with YANG (lib.yang)
- Hardware
  - PCI (lib.hardware.pci)
  - Register (lib.hardware.register)
- Protocols
- IPsec
  - Encapsulating Security Payload (lib.ipsec.esp)
- Snabb NFV
- LISPER
  - LISPER (program.lisper)
- Ptree
  - Ptree (program.ptree)
- Watchdog (lib.watchdog.watchdog)
Snabblab
- Guidelines
- Servers
- Get started
- Using the lab
- Questions
- Thanks

Note: This reference manual is a draft. The API defined in this document is not guaranteed to be stable or complete and future versions of Snabb will introduce backwards incompatible changes. With that being said, discrepancies between this document and the actual Snabb Switch implementation are considered to be bugs. Please report them in order to help improve this document.

Introduction

Snabb is an extensible, virtualized, Ethernet networking toolkit. With Snabb you can implement networking applications using the Lua language. Snabb includes all the tools you need to quickly realize your network designs and its really fast too! Furthermore, Snabb is extensible and encourages you to grow the ecosystem to match your requirements.

Architecture

The Snabb Core forms a runtime environment (engine) which executes your design. A design is simply a Lua script used to drive the Snabb stack, you can think of it as your top-level “main” routine.

In order to add functionality to the Snabb stack you can load modules into the Snabb engine. These can be Lua modules as well as native code objects. We differentiate between two classes of modules, namely libraries and Apps. Libraries are simple collections of program utilities to be used in your designs, apps or other libraries, just as you might expect. Apps, on the other hand, are code objects that implement a specific interface, which is used by the Snabb engine to organize an App Network.

Network

Usually, a Snabb design will create a series of apps, interconnect these in a desired way using links and finally pass the resulting app network on to the Snabb engine. The engine’s job is to:

Pump traffic through the app network
Keep the app network running (e.g. restart failed apps)
Report on the network status

Snabb API

The core modules defined below can be loaded using Lua’s require. For example:

local config = require("core.config")

local c = config.new()
...

App

An app is an isolated implementation of a specific networking function. For example, a switch, a router, or a packet filter.

Apps receive packets on input ports, perform some processing, and transmit packets on output ports. Each app has zero or more input and output ports. For example, a packet filter may have one input and one output port, while a packet recorder may have only an input port. Every app must implement the interface below. Methods which may be left unimplemented are marked as “optional”.

— Method myapp:new arg

Required. Create an instance of the app with a given argument arg. Myapp:new must return an instance of the app. The handling of arg is up to the app but it is encouraged to use core.config’s parse_app_arg to parse arg.

— Field myapp.input

— Field myapp.output

Tables of named input and output links. These tables are initialized by the engine for use in processing and are read-only.

— Field myapp.appname

Name of the app. Read-only.

— Field myapp.shm

Can be set to a specification for core.shm.create_frame. When set, this field will be initialized to a frame of shared memory objects by the engine.

— Field myapp.config

Can be set to a specification for core.lib.parse. When set, the specification will be used to validate the app’s arg when it is configured using config.app.

— Method myapp:link

Optional. Called any time the app’s links may have been changed (including on start-up). Guaranteed to be called before pull and push are called with new links.

— Method myapp:pull

Optional. Pull packets into the network.

For example: Pull packets from a network adapter into the app network by transmitting them to output ports.

— Method myapp:push

Optional. Push packets through the system.

For example: Move packets from input ports to output ports or to a network adapter.

— Method myapp:reconfig arg

Optional. Reconfigure the app with a new arg. If this method is not implemented the app instance is discarded and a new instance is created.

— Method myapp:report

Optional. Print a report of the current app status.

— Method myapp:stop

Optional. Stop the app and release associated external resources.

— Field myapp.zone

Optional. Name of the LuaJIT profiling zone used for this app (descriptive string). The default is the module name.

Config (core.config)

A config is a description of a packet-processing network. The network is a directed graph. Nodes in the graph are apps that each process packets in a specific way. Each app has a set of named input and output ports—often called rx and tx. Edges of the graph are unidirectional links that carry packets from an output port to an input port.

The config is a purely passive data structure. Creating and manipulating a config object does not immediately affect operation. The config has to be activated using engine.configure.

— Function config.new

Creates and returns a new empty configuration.

— Function config.app config, name, class, arg

Adds an app of class with arg to the config where it will be assigned to name.

Example:

config.app(c, "nic", Intel82599, {pciaddr = "0000:00:00.0"})

— Function config.link config, linkspec

Add a link defined by linkspec to the config config. Linkspec must be a string of the format

app_name1.output_port->app_name2.input_port

where app_name1 and app_name2 are names of apps in config and output_port and input_port are valid output and input ports of the referenced apps respectively.

Example:

config.link(c, "nic1.tx->nic2.rx")

Engine (core.app)

The engine executes a config by initializing apps, creating links, and driving the flow of execution. The engine also performs profiling and reporting functions. It can be reconfigured during runtime. Within Snabb Switch scripts the core.app module is bound to the global engine variable.

— Function engine.configure config

Configure the engine to use a new config config. You can safely call this method many times to incrementally update the running app network. The engine updates the app network as follows:

Apps that did not exist in the old configuration are started.
Apps that do not exist in the new configuration are stopped. (The app stop() method is called if defined.)
Apps with unchanged configurations are preserved.
Apps with changed configurations are updated by calling their reconfig() method. If the reconfig() method is not implemented then the old instance is stopped a new one started.
Links with unchanged endpoints are preserved.

— Function engine.main options

Run the Snabb engine. Options is a table of key/value pairs. The following keys are recognized:

duration - Duration in seconds to run the engine for (as a floating point number). If this is set you cannot supply done.
done - A function to be called repeatedly by engine.main until it returns true. Once it returns true the engine will be stopped and engine.main will return. If this is set you cannot supply duration.
report - A table which configures the report printed before engine.main() returns. The keys showlinks and showapps can be set to boolean values to force or suppress link and app reporting individually. By default `engine.main()’ will report on links but not on apps.
measure_latency - By default, the breathe() loop is instrumented to record the latency distribution of running the app graph. This information can be processed by the snabb top program. Passing measure_latency=false in the options will disable this instrumentation.
no_report - A boolean value. If true no final report will be printed.

— Function engine.stop

Stop all apps in the engine by loading an empty configuration.

— Function engine.now

Returns monotonic time in seconds as a floating point number. Suitable for timers.

— Variable engine.busywait

If set to true then the engine polls continuously for new packets to process. This consumes 100% CPU and makes processing latency less vulnerable to kernel scheduling behavior which can cause pauses of more than one millisecond.

Default: false

— Variable engine.Hz

Frequency at which to poll for new input packets. The default value is ‘false’ which means to adjust dynamically up to 100us during low traffic. The value can be overridden with a constant integer saying how many times per second to poll.

This setting is not used when engine.busywait is true.

Link (core.link)

A link is a ring buffer used to store packets between apps. Links can be treated either like arrays—accessing their internal structure directly—or as streams of packets by using their API functions.

— Function link.empty link

Predicate used to test if a link is empty. Returns true if link is empty and false otherwise.

— Function link.full link

Predicate used to test if a link is full. Returns true if link is full and false otherwise.

— Function link.nreadable link

Returns the number of packets on link.

— Function link.nwriteable link

Returns the remaining number of packets that fit onto link.

— Function link.receive link

Returns the next available packet (and advances the read cursor) on link. If the link is empty an error is signaled.

— Function link.front link

Return the next available packet without advancing the read cursor on link. If the link is empty, nil is returned.

— Function link.transmit link, packet

Transmits packet onto link. If the link is full packet is dropped (and the drop counter increased).

— Function link.stats link

Returns a structure holding ring statistics for the link:

txbytes, rxbytes: Counts of transferred bytes.
txpackets, rxpackets: Counts of transferred packets.
txdrop: Count of packets dropped due to ring overflow.

Packet (core.packet)

A packet is an FFI object of type struct packet representing a network packet that is currently being processed. The packet is used to explicitly manage the life cycle of the packet. Packets are explicitly allocated and freed by using packet.allocate and packet.free. When a packet is received using link.receive its ownership is acquired by the calling app. The app must then ensure to either transfer the packet ownership to another app by calling link.transmit on the packet or free the packet using packet.free. Apps may only use packets they own, e.g. packets that have not been transmitted or freed. The number of allocatable packets is limited by the size of the underlying “freelist”, e.g. a pool of unused packet objects from and to which packets are allocated and freed.

— Type struct packet

struct packet {
    uint16_t length;
    uint8_t  data[packet.max_payload];
};

— Constant packet.max_payload

The maximum payload length of a packet.

— Function packet.allocate

Returns a new empty packet. An an error is raised if there are no packets left on the freelist. Initially the length of the allocated is 0, and its data is uninitialized garbage.

— Function packet.free packet

Frees packet and puts in back onto the freelist.

— Function packet.clone packet

Returns an exact copy of packet.

— Function packet.resize packet, length

Sets the payload length of packet, truncating or extending its payload. In the latter case the contents of the extended area at the end of the payload are filled with zeros.

— Function packet.append packet, pointer, length

Appends length bytes starting at pointer to the end of packet. An error is raised if there is not enough space in packet to accomodate length additional bytes.

— Function packet.prepend packet, pointer, length

Prepends length bytes starting at pointer to the front of packet, taking ownership of the packet and returning a new packet. An error is raised if there is not enough space in packet to accomodate length additional bytes.

— Function packet.shiftleft packet, length

Take ownership of packet, truncate it by length bytes from the front, and return a new packet. Length must be less than or equal to length of packet.

— Function packet.shiftright packet, length

Take ownership of packet, moves packet payload to the right by length bytes, growing packet by length. Returns a new packet. The sum of length and length of packet must be less than or equal to packet.max_payload.

— Function packet.from_pointer pointer, length

Allocate packet and fill it with length bytes from pointer.

— Function packet.from_string string

Allocate packet and fill it with the contents of string.

— Function **packet.clone_to_memory* pointer packet

Creates an exact copy of at memory pointed to by pointer. Pointer must point to a packet.packet_t.

Memory (core.memory)

Snabb allocates special DMA memory that can be accessed directly by network cards. The important characteristic of DMA memory is being located in contiguous physical memory at a stable address.

— Function memory.dma_alloc bytes, [alignment]

Returns a pointer to bytes of new DMA memory.

Optionally a specific alignment requirement can be provided (in bytes). The default alignment is 128.

— Function memory.virtual_to_physical pointer

Returns the physical address (uint64_t) the DMA memory at pointer.

— Variable memory.huge_page_size

Size of a single huge page in bytes. Read-only.

Shared Memory (core.shm)

This module facilitates creation and management of named shared memory objects. Objects can be created using shm.create similar to ffi.new, except that separate calls to shm.open for the same name will each return a new mapping of the same shared memory. Different processes can share memory by mapping an object with the same name (and type). Each process can map any object any number of times.

Mappings are deleted on process termination or with an explicit shm.unmap. Names are unlinked from objects that are no longer needed using shm.unlink. Object memory is freed when the name is unlinked and all mappings have been deleted.

Names can be fully qualified or abbreviated to be within the current process. Here are examples of names and how they are resolved where <pid> is the PID of this process:

Local: foo/bar ⇒ /var/run/snabb/<pid>/foo/bar
Fully qualified: /1234/foo/bar ⇒ /var/run/snabb/1234/foo/bar

Behind the scenes the objects are backed by files on ram disk (/var/run/snabb/<pid>) and accessed with the equivalent of POSIX shared memory (shm_overview(7)). The files are automatically removed on shutdown unless the environment SNABB_SHM_KEEP is set. The location /var/run/snabb can be overridden by the environment variable SNABB_SHM_ROOT.

Shared memory objects are created world-readable for convenient access by diagnostic tools. You can lock this down by setting SNABB_SHM_ROOT to a path under a directory with appropriate permissions.

The practical limit on the number of objects that can be mapped will depend on the operating system limit for memory mappings. On Linux the default limit is 65,530 mappings:

$ sysctl vm.max_map_count vm.max_map_count = 65530

— Function shm.create name, type

Creates and maps a shared object of type into memory via a hierarchical name. Returns a pointer to the mapped object.

— Function shm.open name, type, [readonly]

Maps an existing shared object of type into memory via a hierarchical name. If readonly is non-nil the shared object is mapped in read-only mode. Readonly defaults to nil. Fails if the shared object does not already exist. Returns a pointer to the mapped object.

— Function shm.alias new-path existing-path

Create an alias (symbolic link) for an object.

— Function shm.path name

Returns the fully-qualified path for an object called name.

— Function shm.exists name

Returns a true value if shared object by name exists.

— Function shm.unmap pointer

Deletes the memory mapping for pointer.

— Function shm.unlink path

Unlinks the subtree of objects designated by path from the filesystem.

— Function shm.children path

Returns an array of objects in the directory designated by path.

— Function shm.register type, module

Registers an abstract shared memory object type implemented by module in shm.types. Module must provide the following functions:

create name, …
open, name

and can optionally provide the function:

delete, name

The module’s type variable must be bound to type. To register a new type a module might invoke shm.register like so:

type = shm.register('mytype', getfenv())
-- Now the following holds true:
--   shm.types[type] == getfenv()

— Variable shm.types

A table that maps types to modules. See shm.register.

— Function shm.create_frame path, specification

Creates and returns a shared memory frame by specification under path. A frame is a table of mapped—possibly abstract‑shared memory objects. Specification must be of the form:

{ <name> = {<module>, ...},
  ... }

Module must implement an abstract type registered with shm.register, and is followed by additional initialization arguments to its create function. Example usage:

local counter = require("core.counter")
-- Create counters foo/bar/{dtime,rxpackets,txpackets}.counter
local f = shm.create_frame(
   "foo/bar",
   {dtime     = {counter, C.get_unix_time()},
    rxpackets = {counter},
    txpackets = {counter}})
counter.add(f.rxpackets)
counter.read(f.dtime)

— Function shm.open_frame path

Opens and returns the shared memory frame under path for reading.

— Function shm.delete_frame frame

Deletes/unmaps a shared memory frame. The frame directory is unlinked if frame was created by shm.create_frame.

Counter (core.counter)

Double-buffered shared memory counters. Counters are 64-bit unsigned values. Registered with core.shm as type counter.

— Function counter.create name, [initval]

Creates and returns a counter by name, initialized to initval. Initval defaults to 0.

— Function counter.open name

Opens and returns the counter by name for reading.

— Function counter.delete name

Deletes and unmaps the counter by name.

— Function counter.commit

Commits buffered counter values to public shared memory.

— Function counter.set counter, value

Sets counter to value.

— Function counter.add counter, [value]

Increments counter by value. Value defaults to 1.

— Function counter.read counter

Returns the value of counter.

Histogram (core.histogram)

Shared memory histogram with logarithmic buckets. Registered with core.shm as type histogram.

— Function histogram.new min, max

Returns a new histogram, with buckets covering the range from min to max. The range between min and max will be divided logarithmically.

— Function histogram.create name, min, max

Creates and returns a histogram as in histogram.new by name. If the file exists already, it will be cleared.

— Function histogram.open name

Opens and returns histogram by name for reading.

— Method histogram:add measurement

Adds measurement to histogram.

— Method histogram:iterate prev

When used as for count, lo, hi in histogram:iterate(), visits all buckets in histogram in order from lowest to highest. Count is the number of samples recorded in that bucket, and lo and hi are the lower and upper bounds of the bucket. Note that count is an unsigned 64-bit integer; to get it as a Lua number, use tonumber.

If prev is given, it should be a snapshot of the previous version of the histogram. In that case, the count values will be returned as a difference between their values in histogram and their values in prev.

— Method histogram:snapshot [dest]

Copies out the contents of histogram into the histogram dest and returns dest. If dest is not given, the result will be a fresh histogram.

— Method histogram:clear

Clears the buckets of histogram.

— Method **histogram:wrap_thunk* thunk, now

Returns a closure that wraps thunk, measuring and recording the difference between calls to now before and after thunk into histogram.

— Method **histogram:summarize* prev

Returns the approximate minimum, average, and maximum values recorded in histogram.

If prev is given, it should be a snapshot of a previous version of the histogram. In that case, this method returns the approximate minimum, average and maximum values for the difference between histogram and prev.

Lib (core.lib)

The core.lib module contains miscellaneous utilities.

— Function lib.equal x, y

Predicate to test if x and y are structurally similar (isomorphic).

— Function lib.can_open filename, mode

Predicate to test if file at filename can be successfully opened with mode.

— Function lib.can_read filename

Predicate to test if file at filename can be successfully opened for reading.

— Function lib.can_write filename

Predicate to test if file at filename can be successfully opened for writing.

— Function lib.readcmd command, what

Runs Unix shell command and returns what of its output. What must be a valid argument to file:read.

— Function lib.readfile filename, what

Reads and returns what from file at filename. What must be a valid argument to file:read.

— Function lib.writefile filename, value

Writes value to file at filename using file:write. Returns the value returned by file:write.

— Function lib.readlink filename

Returns the true name of symbolic link at filename.

— Function lib.dirname filename

Returns the dirname(3) of filename.

— Function lib.basename filename

Returns the basename(3) of filename.

— Function lib.firstfile directory

Returns the filename of the first file in directory.

— Function lib.firstline filename

Returns the first line of file at filename as a string.

— Function lib.load_string string

Evaluates and returns the value of the Lua expression in string.

— Function lib.load_conf filename

Evaluates and returns the value of the Lua expression in file at filename.

— Function lib.store_conf filename, value

Writes value to file at filename as a Lua expression. Supports tables, strings and everything that can be readably printed using print.

— Function lib.bits bitset, basevalue

Returns a bitmask using the values of bitset as indexes. The keys of bitset are ignored (and can be used as comments).

Example:

bits({RESET=0,ENABLE=4}, 123) => 1<<0 | 1<<4 | 123

— Function lib.bitset value, n

Predicate to test if bit number n of value is set.

— Function lib.bitfield size, struct, member, offset, nbits, value

Combined accesor and setter function for bit ranges of integers in cdata structs. Sets nbits (number of bits) starting from offset to value. If value is not given the current value is returned.

Size may be one of 8, 16 or 32 depending on the bit size of the integer being set or read.

Struct must be a pointer to a cdata object and member must be the literal name of a member of struct.

Example:

local struct_t = ffi.typeof[[struct { uint16_t flags; }]]
-- Assuming `s' is an instance of `struct_t', set bits 4-7 to 0xF:
lib.bitfield(16, s, 'flags', 4, 4, 0xf)
-- Get the value:
lib.bitfield(16, s, 'flags', 4, 4) -- => 0xF

— Function string:split pattern

Returns an iterator over the string split by pattern. Pattern must be a valid argument to string:gmatch.

Example:

for word, sep in ("foo!bar!baz"):split("(!)") do
    print(word, sep)
end

> foo   !
> bar   !
> baz   nil

— Function lib.hexdump string

Returns hexadecimal string for bytes in string.

— Function lib.hexundump hexstring, n, error

Returns string of n bytes for hexstring. Throws an error if less than n hex-encoded bytes could be parsed unless error is false.

Error is optional and can be the error message to throw.

— Function lib.comma_value n

Returns a string for decimal number n with magnitudes separated by commas. Example:

comma_value(1000000) => "1,000,000"

— Function lib.random_bytes_from_dev_urandom length

Return length bytes of random data, as a byte array, taken from /dev/urandom. Suitable for cryptographic usage.

— Function lib.random_bytes_from_math_random length

Return length bytes of random data, as a byte array, where each byte was taken from math.random(0, 255). Not suitable for cryptographic usage.

— Function lib.random_bytes length — Function lib.randomseed seed

Initialize Snabb’s random number generation facility. If seed is nil, then the Lua math.random() function will be seeded from /dev/urandom, and lib.random_bytes will be initialized to lib.random_bytes_from_dev_urandom. This is Snabb’s default mode of operation.

Sometimes it’s useful to make Snabb use deterministic random numbers. In that case, pass a seed to lib.randomseed; Snabb will set lib.random_bytes to lib.random_bytes_from_math_random, and also print out a message to stderr indicating that we are using lower-quality deterministic random numbers.

As part of its initialization process, Snabb will call lib.randomseed with the value of the SNABB_RANDOM_SEED environment variable (if any). Set this environment variable to enable deterministic random numbers.

— Function lib.bounds_checked type, base, offset, size

Returns a table that acts as a bounds checked wrapper around a C array of type and size starting at base plus offset. Type must be a ctype and the caller must ensure that the allocated memory region at base/offset is at least sizeof(type)*size bytes long.

— Function lib.throttle seconds

Return a closure that returns true at most once during any seconds (a floating point value) time interval, otherwise false.

— Function lib.timeout seconds

Returns a closure that returns true if seconds (a floating point value) have elapsed since it was created, otherwise false.

— Function lib.waitfor condition

Blocks until the function condition returns a true value.

— Function lib.waitfor2 name, condition, attempts, interval

Repeatedly calls the function condition in interval (milliseconds). If condition returns a true value waitfor2 returns. If condition does not return a true value after attempts waitfor2 raises an error identified by name.

— Function lib.yesno flag

Returns the string "yes" if flag is a true value and "no" otherwise.

— Function lib.align value, size

Return the next integer that is a multiple of size starting from value.

— Function lib.csum pointer, length

Computes and returns the “IP checksum” length bytes starting at pointer.

— Function lib.update_csum pointer, length, checksum

Returns checksum updated by length bytes starting at pointer. The default of checksum is 0LL.

— Function lib.finish_csum checksum

Returns the finalized checksum.

— Function lib.malloc etype

Returns a pointer to newly allocated DMA memory for etype.

— Function lib.deepcopy object

Returns a copy of object. Supports tables as well as ctypes.

— Function lib.array_copy array

Returns a copy of array. Array must not be a “sparse array”.

— Function lib.htonl n

— Function lib.htons n

Host to network byte order conversion functions for 32 and 16 bit integers n respectively. Unsigned.

— Function lib.ntohl n

— Function lib.ntohs n

Network to host byte order conversion functions for 32 and 16 bit integers n respectively. Unsigned.

— Function lib.random_bytes count

Return a fresh array of count random bytes. Suitable for cryptographic usage.

— Function lib.parse arg, config

Validates arg against the specification in config, and returns a fresh table containing the parameters in arg and any omitted optional parameters with their default values. Given arg, a table of parameters or nil, assert that from config all of the required keys are present, fill in any missing values for optional keys, and error if any unknown keys are found. Config has the following format:

config := { key = {[required=boolean], [default=value]}, ... }

Each key is optional unless required is set to a true value, and its default value defaults to nil.

Example:

lib.parse({foo=42, bar=43}, {foo={required=true}, bar={}, baz={default=44}})
  => {foo=42, bar=43, baz=44}

Function lib.set vargs

Reads a variable number of arguments and returns a table representing a set. The returned value can be used to query whether an element belongs or not to the set.

Example:

local t = set('foo', 'bar')
t['foo']  -- yields true.
t['quax'] -- yields false.

Multiprocess operation (core.worker)

Snabb can operate as a group of cooperating processes. The main process is the initial one that you start directly. The optional worker processes are children spawned when the main process calls the core.worker module.

Multiprocessing

Each worker is a complete Snabb process. They can define app networks, run the engine, and do everything else that ordinary Snabb processes do. The exact behavior of each worker is determined by a Lua expression provided upon creation.

Groups of Snabb processes each have the following special properties:

Group termination: Terminating the main process automatically terminates all of the workers. This works for all process termination scenarios including kill -9.
Shared DMA memory: DMA memory pointers obtained with memory.dma_alloc() are usable by all processes in the group. This means that you can share DMA memory pointers between processes, for example via shm shared memory objects, and reference them from any process. (The memory is automatically mapped at the expected address via a SEGV signal handler.)
PCI device shutdown: For each PCI device opened by a process within the group, bus mastering (DMA) is disabled upon termination before any DMA memory is returned to the kernel. This prevents “dangling” DMA requests from corrupting memory that has been freed and reused. See lib.hardware.pci for details.

The core.worker API functions are available in the main process only:

— Function worker.start name luacode

Start a named worker process. The worker starts with a completely fresh Snabb process image (fork()+execve()) and then executes the string luacode as a Lua source code expression.

Example:

worker.start("myworker", [[
   print("hello world, from a Snabb worker process!")
   print("could configure and run the engine now...")
]])

— Function worker.stop name

Stop a named worker process. The worker is abruptly killed.

Example:

worker.stop("myworker")

— Function worker.status

Return a table summarizing the status of all workers. The table key is the worker name and the value is a table with pid and alive attributes.

Example:

for w, s in pairs(worker.status()) do
   print(("  worker %s: pid=%s alive=%s"):format(
         w, s.pid, s.alive))
end

Output:

worker w3: pid=21949 alive=true
worker w1: pid=21947 alive=true
worker w2: pid=21948 alive=true

Main

Snabb designs can be run either with:

snabb <snabb-arg>* <design> <design-arg>*

#!/usr/bin/env snabb <snabb-arg>*
...

The main module provides an interface for running Snabb scripts. It exposes various operating system functions to scripts.

— Field main.parameters

A list of command-line arguments to the running script. Read-only.

— Function main.exit status

Cleanly exits the process with status.

Basic Apps (apps.basic.basic_apps)

The module apps.basic.basic_apps provides apps with general functionality for use in you app networks.

Source

The Source app is a synthetic packet generator. On each breath it fills each attached output link with new packets. It accepts a number as its configuration argument which is the byte size of the generated packets. By default, each packet is 60 bytes long. The packet data is initialized with zero bytes.

Source

Join

The Join app joins together packets from N input links onto one output link. On each breath it outputs as many packets as possible from the inputs onto the output.

Join

Split

The Split app splits packets from multiple inputs across multiple outputs. On each breath it transfers as many packets as possible from the input links to the output links.

Split

Sink

The Sink app receives all packets from any number of input links and discards them. This can be handy in combination with a Source.

Sink

Tee

The Tee app receives all packets from any number of input links and transfers each received packet to all output links. It can be used to merge and/or duplicate packet streams

Tee

Repeater

The Repeater app collects all packets received from the input link and repeatedly transfers the accumulated packets to the output link. The packets are transmitted in the order they were received.

Repeater

Truncate

The Truncate app sends all packets received from the input to the output link and truncates or zero pads each packet to a given length. It accepts a number as its configuration argument which is the length of the truncated or padded packets.

Truncate

Sample

The Sample app forwards packets every nth packet from the input link to the output link, and drops all others packets. It accepts a number as its configuration argument which is n.

Sample

Intel 82599 Ethernet Controller Apps

Intel82599 (apps.intel.intel_app)

The Intel82599 drives one port of an Intel 82599 Ethernet controller. Packets taken from the rx port are transmitted onto the network. Packets received from the network are put on the tx port.

Intel82599

— Method Intel82599.dev:get_rxstats

Returns a table with the following keys:

counter_id - Counter id
packets - Number of packets received
dropped - Number of packets dropped
bytes - Total bytes received

— Method Intel82599.dev:get_txstats

Returns a table with the following keys:

counter_id - Counter id
packets - Number of packets sent
bytes - Total bytes sent

Configuration

The Intel82599 app accepts a table as its configuration argument. The following keys are defined:

— Key pciaddr

Required. The PCI address of the NIC as a string.

— Key macaddr

Optional. The MAC address to use as a string. The default is a wild-card (e.g. accept all packets).

— Key vlan

Optional. A twelve bit integer (0-4095). If set, incoming packets from other VLANs are dropped and outgoing packets are tagged with a VLAN header.

— Key vmdq

Optional. Boolean, defaults to false. Enables interface virtualization. Allows to have multiple Intel82599 apps per port. If enabled, macaddr must be specified.

— Key mirror

Optional. A table. If set, this app will receive copies of all selected packets on the physical port. The selection is configured by setting keys of the mirror table. Either mirror.pool or mirror.port may be set.

If mirror.pool is true all pools defined on this physical port are mirrored. If mirror.pool is an array of pool numbers then the specified pools are mirrored.

If mirror.port is one of “in”, “out” or “inout” all incoming and/or outgoing packets on the port are mirrored respectively. Note that this does not include internal traffic which does not enter or exit through the physical port.

— Key rxcounter

— Key txcounter

Optional. Four bit integers (0-15). If set, incoming/outgoing packets will be counted in the selected statistics counter respectively. Multiple apps can share a counter. To retrieve counter statistics use Intel82599.dev:get_rxstats() and Intel82599.dev:get_txstats().

— Key rate_limit

Optional. Number. Limits the maximum Mbit/s to transmit. Default is 0 which means no limit. Only applies to outgoing traffic.

— Key priority

Optional. Floating point number. Weight for the round-robin algorithm used to arbitrate transmission when rate_limit is not set or adds up to more than the line rate of the physical port. Default is 1.0 (scaled to the geometric middle of the scale which goes from 1/128 to 128). The absolute value is not relevant, instead only the ratio between competing apps controls their respective bandwidths. Only applies to outgoing traffic.

For example, if two apps without rate_limit set have the same priority, both get the same output bandwidth. If the priorities are 3.0/1.0, the output bandwidth is split 75%/25%. Likewise, 1.0/0.333 or 1.5/0.5 yield the same result.

Note that even a low-priority app can use the whole line rate unless other (higher priority) apps are using up the available bandwidth.

Performance

The Intel82599 app can transmit and receive at approximately 10 Mpps per processor core.

Hardware limits

Each physical Intel 82599 port supports the use of up to:

64 pools (virtualized Intel82599 app instances)
127 MAC addresses (see the macaddr configuration option)
64 VLANs (see the vlan configuration option)
4 mirror pools (see the mirror configuration option)

LoadGen (apps.intel.loadgen)

LoadGen is a load generator app based on the Intel 82599 Ethernet controller. It reads up to 32,000 packets from the input port and transmits them repeatedly onto the network. All incoming packets are dropped.

LoadGen

Configuration

The LoadGen app accepts a string as its configuration argument. The given string denotes the PCI address of the NIC to use.

Performance

The LoadGen app can transmit at line-rate (14 Mpps) without significant CPU usage.

Intel i210 / i350 / 82599 Ethernet Controller apps (apps.intel_mp.intel_mp)

The intel_mp.Intel app provides drivers for Intel i210/i250/82599 based network cards. The driver exposes multiple receive and transmit queues that can be attached to separate instances of the app on different processes.

The links are named input and output.

Intel

Caveats

If attaching multiple processes to a single NIC, performance appears better with engine.busywait = false.

The intel_mp.Intel app can drive an Intel 82599 NIC at 14 million pps.

— Method Intel:get_rxstats

Returns a table with the following keys:

counter_id - Counter id
packets - Number of packets received
dropped - Number of packets dropped
bytes - Total bytes received

— Method Intel:get_txstats

Returns a table with the following keys:

counter_id - Counter id
packets - Number of packets sent
bytes - Total bytes sent

Configuration

— Key pciaddr

Required. The PCI address of the NIC as a string.

— Key ndesc

Optional. Number of DMA descriptors to use i.e. size of the DMA transmit and receive queues. Must be a multiple of 128. Default is not specified but assumed to be broadly applicable.

— Key rxq

Optional. The receive queue to attach to, numbered from 0. The default is 0. When VMDq is enabled, this number is used to index a queue (0 or 1) for the selected pool. Passing false will disable the receive queue.

— Key txq

Optional. The transmit queue to attach to, numbered from 0. The default is 0. Passing false will disable the transmit queue.

— Key vmdq

Optional. A boolean parameter that specifies whether VMDq (Virtual Machine Device Queues) is enabled. When VMDq is enabled, each instance of the driver is associated with a pool that can be assigned a MAC address or VLAN tag. Packets are delivered to pools that match the corresponding MACs or VLAN tags. Each pool may be associated with several receive and transmit queues.

For a given NIC, all driver instances should have this parameter either enabled or disabled uniformly. If this is enabled, macaddr must be specified.

— Key vmdq_queueing_mode

Optional. Sets the queueing mode to use in VMDq mode. Has no effect when VMDq is disabled. The available queueing modes for the 82599 are "rss-64-2" (the default with 64 pools, 2 queues each) and "rss-32-4" (32 pools, 4 queues each). The i350 provides only a single mode (8 pools, 2 queues each) and hence ignores this option.

— Key poolnum

Optional. The VMDq pool to associate with, numbered from 0. The default is to select a pool number automatically. The maximum pool number depends on the queueing mode.

— Key macaddr

Optional. The MAC address to use as a string. The default is a wild-card (i.e., accept all packets).

— Key vlan Optional. A twelve-bit integer (0-4095). If set, incoming packets from other VLANs are dropped and outgoing packets are tagged with a VLAN header.

— Key mirror

If mirror.pool is true all pools defined on this physical port are mirrored. If mirror.pool is an array of pool numbers then the specified pools are mirrored.

— Key rxcounter

— Key txcounter

— Key rate_limit

Optional. Number. Limits the maximum Mbit/s to transmit. Default is 0 which means no limit. Only applies to outgoing traffic.

— Key priority

Note that even a low-priority app can use the whole line rate unless other (higher priority) apps are using up the available bandwidth.

— Key rsskey

Optional. The rsskey is a 32 bit integer that seeds the hash used to distribute packets across queues. If there are multiple levels of RSS snabb devices in the packet flow making this unique will help packet distribution.

— Key wait_for_link

Optional. Boolean that indicates if new should block until there is a link light or not. The default is false.

— Key linkup_wait

Optional Number of seconds new waits for the device to come up. The default is 120.

— Key linkup_wait_recheck

Optional If the linkup_wait option is true, the number of seconds to sleep between checking the link state again. The default is 0.1 seconds.

— Key mtu

Optional The maximum packet length sent or received, excluding the trailing 4 byte CRC. The default is 9014.

— Key master_stats

Optional Boolean indicating whether to elect an arbitrary app (the master) to collect device statistics. The default is true.

— Key run_stats

Optional Boolean indicating if this app instance should collect device statistics. One per physical NIC (conflicts with master_stats). There is a small but detectable run time performance hit incurred. The default is false.

— Key mac_loopback

Optional Boolean indicating if the card should operate in “Tx->Rx MAC Loopback mode” for diagnostics or testing purposes. If this is true then wait_for_link is implicitly false. The default is false.

RSS hashing methods

RSS will distribute packets based on as many of the fields below as are present in the packet:

Source / Destination IP address
Source / Destination TCP ports
Source / Destination UDP ports

Default RSS Queue

Packets that are not IPv4 or IPv6 will be delivered to receive queue 0.

Hardware limits

Each chipset supports a differing number of receive / transmit queues:

Intel82599 supports 16 receive and 16 transmit queues, 0-15
Intel1g i350 supports 8 receive and 8 transmit queues, 0-7
Intel1g i210 supports 4 receive and 4 transmit queues, 0-3

The Intel82599 supports both VMDq and RSS with 32/64 pools and 4/2 RSS queues for each pool. Intel1g i350 supports both VMDq and RSS with 8 pools 2 queues for each pool. Intel1g i210 does not support VMDq.

Solarflare Ethernet Controller Apps

Solarflare (apps.solarflare.solarflare)

The Solarflare app drives one port of a Solarflare SFN7 Ethernet controller. Multiple instances of the Solarflare app can be instantiated on the same PCI device. Packets received from the network will be dispatched between apps based on destination MAC address and VLAN. Packets taken from the rx port are transmitted onto the network. Packets received from the network are put on the tx port.

Solarflare

The Solarflare app requires OpenOnload version 201502 to be installed and the sfc module to be loaded.

Configuration

The Solarflare app accepts a table as its configuration argument. The following keys are defined:

— Key pciaddr

Required. The PCI address of the NIC as a string.

— Key macaddr

Optional. The MAC address to use as a string. The default is a wild-card (e.g. accept all packets).

— Key vlan

Optional. A twelve bit integer (0-4095). If set, incoming packets from other VLANs are dropped and outgoing packets are tagged with a VLAN header.

RateLimiter App (apps.rate_limiter.rate_limiter)

The RateLimiter app implements a Token bucket algorithm with a single bucket dropping non-conforming packets. It receives packets on the input port and transmits conforming packets to the output port.

RateLimiter

— Method RateLimiter:snapshot

Returns throughput statistics in form of a table with the following fields:

rx - Number of packets received
tx - Number of packets transmitted
time - Current time in nanoseconds

Configuration

The RateLimiter app accepts a table as its configuration argument. The following keys are defined:

— Key rate

Required. Rate in bytes per second to which throughput should be limited.

— Key bucket_capacity

Required. Bucket capacity in bytes. Should be equal or greater than rate. Otherwise the effective rate may be limted.

— Key initial_capacity

Optional. Initial bucket capacity in bytes. Defaults to bucket_capacity.

Performance

The RateLimiter app is able to process more than 20 Mpps per CPU core. Refer to its selftest for details.

PcapFilter App (apps.packet_filter.pcap_filter)

The PcapFilter app receives packets on the input port and transmits conforming packets to the output port. In order to conform, a packet must match the pcap-filter expression of the PcapFilter instance and/or belong to a sanctioned connection. For a connection to be sanctioned it must be tracked in a state table by a PcapFilter app using the same state table. All PcapFilter apps share a global namespace of state table identifiers. Multiple PcapFilter apps—e.g. for inbound and outbound traffic—can refer to the same connection by sharing a state table identifer.

PcapFilter

Configuration

The PcapFilter app accepts a table as its configuration argument. The following keys are available:

— Key filter

Required. A string containing a pcap-filter expression.

— Key state_table

Optional. A string naming a state table. If set, packets passing any rule will be tracked in the specified state table and any packet that belongs to a tracked connection in the specified state table will be let pass.

Special Counters

— Key sessions_established

Total number of sessions established.

IPv4 Apps

ARP (apps.ipv4.arp)

The ARP app implements the Address Resolution Protocol, allowing a Snabb network function to automatically learn the next-hop MAC address for outgoing IPv4 traffic. The ARP app will also respond to incoming address resolution requests from other hosts on the same network. The next-hop MAC address may also be statically configured. Finally, the Ethernet source address for all outgoing traffic will be set to the self_mac address configured on the ARP app.

All of this together means that using the ARP app in your network function allows you to forget about link-layer concerns, for IPv4 traffic anyway.

Topologically, the ARP app sits between your network function and the Ethernet interface. Its north link relays traffic to and from the network function; the south link talks instead to the Ethernet interface.

ARP

Configuration

The ARP app accepts a table as its configuration argument. The following keys are defined:

— Key self_mac

Optional. The MAC address of this network function. If not provided, a random MAC address will be generated. Two random MAC addresses have a one-in-nine-million chance of colliding. The ARP app will ensure that all outgoing southbound traffic will originate from this MAC address.

— Key self_ip

Required. The IPv4 address of this host; used to respond to requests and when making ARP requests.

— Key next_mac

Optional. The MAC address to which to send all network traffic. This ARP app currently hsa the limitation that it assumes that all traffic will go to a single MAC address. If this address is provided as part of the configuration, no ARP request will be made; otherwise it will be determined from the next_ip via ARP.

— Key next_ip

Optional. The IPv4 address of the next-hop host. Required only if next_mac is not specified as part of the configuration.

— Key shared_next_mac_key

Optional. Path to a shared memory location (i.e. /var/run/snabb/PID/PATH) in which to store the resolved next_mac. This ARP resolver might be part of a set of peer processes sharing work via RSS. In that case, an ARP response will probably arrive only to one of the RSS processes, not to all of them. If you are using ARP behind RSS, set shared_next_mac_key to, for example, group/arp-next-mac, to enable the different workers to communicate the next-hop MAC address.

Reassembler (apps.ipv4.reassemble)

The Reassembler app is a filter on incoming IPv4 packets that reassembles fragments. Note that Snabb’s internal MTU is 10240 bytes; attempts to reassemble larger packets will fail.

IPv4Reassembler

The reassembler has a configurable limit for the reassembly buffer size. If the buffer is full and a new reassembly comes in on the input, the reassembler app will randomly evict a pending reassembly from its buffer before starting the new reassembly.

The reassembler app currently does not time out reassemblies that have been around for too long. It could be a good idea to implement timeouts and then be able to issue “timeout exceeded” ICMP errors if needed.

Finally, note that the reassembler app will pass through any incoming packet that is not IPv4.

Configuration

The Reassembler app accepts a table as its configuration argument. The following keys are defined:

— Key max_concurrent_reassemblies

Optional. The maximum number of concurrent reassemblies. Note that each reassembly uses about 11kB of memory. The default is 20000.

— Key max_fragments_per_reassembly

Optional. The maximum number of fragments per reassembly. The default is 40.

Fragmenter (apps.ipv4.fragment)

The Fragmenter app that will fragment any IPv4 packets larger than a configured maximum transmission unit (MTU).

IPv4Fragmenter

Configuration

The Fragmenter app accepts a table as its configuration argument. The following key is defined:

— Key mtu

Required. The maximum transmission unit, in bytes, not including the Ethernet header.

ICMP Echo responder (apps.ipv4.echo)

The ICMPEcho app responds to ICMP echo requests (“pings”) to a given set of IPv4 addresses.

Like the ARP app, ICMPEcho sits between your network function and outside traffic. Its north link relays traffic to and from the network function; the south link talks to the world.

IPv4ICMPEcho

Configuration

The ICMPEcho app accepts a table as its configuration argument. The following keys is defined:

— Key address

Optional. An IPv4 address for which to respond to pings, as a uint8_t[4].

— Key addresses

Optional. An array of IPv4 addresses for which to respond to pings, as a Lua array of uint8_t[4] values.

IPv6 Apps

Nd_light (apps.ipv6.nd_light)

The nd_light app implements a small subset of IPv6 neighbor discovery (RFC4861). It has two duplex ports, north and south. The south port attaches to a network on which neighbor discovery (ND) must be performed. The north port attaches to an app that processes IPv6 packets (including full ethernet frames). Packets transmitted to the north port must be wrapped in full Ethernet frames (which may be empty).

The nd_light app replies to neighbor solicitations for which it is configured as a target and performs rudimentary address resolution for its configured next-hop address. If address resolution succeeds, the Ethernet headers of packets from the north port will be overwritten with headers containing the discovered destination address and the configured source address before they are transmitted over the south port. All packets from the north port are discarded as long as ND has not yet succeeded. Packets received from the south port are transmitted to the north port unaltered.

nd_light

Configuration

The nd_light app accepts a table as its configuration argument. The following keys are defined:

— Key local_mac

Required. Local MAC address as a string or in binary representation.

— Key remote_mac

Optional. MAC address of next_hop address as a string or in binary representation. If this option is present, the nd_light app does not perform neighbor solicitation for the next_hop address and uses remote_mac as the MAC address associated with next_hop.

— Key local_ip

Required. Local IPv6 address as a string or in binary representation.

— Key next_hop

Required. IPv6 address of next hop as a string or in binary representation.

— Key delay

Optional. Neighbor solicitation retransmission delay in milliseconds. Default is 1,000ms.

— Key retrans

Optional. Number of neighbor solicitation retransmissions. Default is unlimited retransmissions.

— Key quiet

Optional. If set to true, suppress log messages about ND activity. Default is false.

Special Counters

— Key ns_checksum_errors

Neighbor solicitation requests dropped due to invalid ICMP checksum.

— Key ns_target_address_errors

Neighbor solicitation requests dropped due to invalid target address.

— Key na_duplicate_errors

Neighbor advertisement requests dropped because next-hop is already resolved.

— Key na_target_address_errors

Neighbor advertisement requests dropped due to invalid target address.

— Key nd_protocol_errors

Neighbor discovery requests dropped due to protocol errors (invalid IPv6 hop-limit or invalid neighbor solicitation request options).

SimpleKeyedTunnel (apps.keyed_ipv6_tunnel.tunnel)

The SimpleKeyedTunnel app implements “a simple L2 Ethernet over IPv6 tunnel encapsulation” as described in Keyed IPv6 Tunnel. It has two duplex ports, encapsulated and decapsulated. Packets transmitted on the decapsulated input port will be encapsulated and put on the encapsulated output port. Packets transmitted on the encapsulated input port will be decapsulated and put on the decapsulated output port.

SimpleKeyedTunnel

Configuration

The SimpleKeyedTunnel app accepts a table as its configuration argument. The following keys are defined:

— Key local_address

Required. Local IPv6 address as a string.

— Key remote_address

Required. Remote IPv6 address as a string.

— Key local_cookie

Required. Local cookie, 8 bytes encoded in a hexadecimal string.

— Key remote_cookie

Required. Remote cookie, 8 bytes encoded in a hexadecimal string.

— Key local_session

Optional. Unsigned integer, 32 bit. If set, the session_id field of the L2TPv3 header will be overwritten with this value.

— Key hop_limit

Optional. Unsigned integer. Sets the hop limit. Default is 64.

— Key default_gateway_MAC

Optional. Destination MAC as a string. Not required if overwritten by an app such as nd_light.

Special Counters

— Key length_errors

Ingress packets dropped due to invalid length (packet too short).

— Key protocol_errors

Ingress packets dropped due to unrecognized IPv6 protocol ID.

— Key cookie_errors

Ingress packets dropped due to wrong cookie value.

— Key remote_address_errors

Ingress packets dropped due to wrong remote IPv6 endpoint address.

— Key local_address_errors

Ingress packets dropped due to wrong local IPv6 endpoint address.

Fragmenter (apps.ipv6.fragment)

The Fragmenter app will fragment any IPv6 packets larger than a configured maximum transmission unit (MTU) or the dynamically discovered MTU on the network path (PMTU) towards a specific destination, depending on the setting of the pmtud configuration option.

If path MTU discovery (PMTUD) is disabled, the app expects to receive packets on its input link and sends (possibly fragmented) packets to its output link

IPv6Fragmenter

If PMTUD is enabled, the app also expects to process packets in the reverse direction in order to be able to intercept and interpret ICMP packets of type 2, code 0. Those packets, known as “Packet Too Big” (PTB) messages, contain reports from nodes on the path towards a particular destination, which indicate that a previously sent packet could not be forwarded due to a MTU bottleneck. The message contains the MTU in question as well as at least the header of the original packet that triggered the PTB message. The Fragmenter app extracts the destination address from the original packet and stores the MTU in a per-destination cache as the PMTU for that address.

Apart from checking the integrity of the ICMP message, the app can optionally also verify whether the message is actually intended for consumption by this instance of the Fragmenter app. For that purpose, the app can be configured with an exhaustive list of IPv6 addresses that are designated to be local to the system. When a PTB message is received, it is checked whether the destination address of the ICMP message as well as the source address of the embedded original packet are contained in this list. The message is discarded if this condition is not met. No such checking is performed if the list is empty.

When the Fragmenter receives a packet on the input link, it first consults the per-destination cache. In case of a hit, the PMTU from the cache takes precedence over the statically configured MTU.

A PMTU is removed from the cache after a configurable timeout to allow the system to discover a larger PMTU, e.g. after a change in network topology.

With PMTUD enabled, the app has two additional links, called north and south

IPv6Fragmenter_PMTUD

All packets received on the south link which are not ICMP packets of type 2, code 0 are passed on unmodified on the north link.

Configuration

The Fragmenter app accepts a table as its configuration argument. The following keys are defined:

— Key mtu

Required. The maximum transmission unit, in bytes, not including the Ethernet header.

— Key pmtud

Optional. If set to true, dynamic path MTU discovery (PMTUD) is enabled. The default is false.

— Key pmtu_timeout

Optional. The amount of time in seconds after which a PMTU is removed from the cache. The default is 600. This key is ignored unless pmtud is true.

— Key pmtu_local_addresses

Optional. A table of IPv6 addresses in human readable representation for which the app will accept PTB messages. The default is an empty table, which disables the check for local addresses.

ICMP Echo responder (apps.ipv6.echo)

The ICMPEcho app responds to ICMP echo requests (“pings”) to a given set of IPv6 addresses.

Like the ARP app, ICMPEcho sits between your network function and outside traffic. Its north link relays traffic to and from the network function; the south link talks to the world.

IPv6ICMPEcho

Configuration

The ICMPEcho app accepts a table as its configuration argument. The following keys is defined:

— Key address

Optional. An IPv6 address for which to respond to pings, as a uint8_t[16].

— Key addresses

Optional. An array of IPv6 addresses for which to respond to pings, as a Lua array of uint8_t[16] values.

VhostUser App (apps.vhost.vhost_user)

The VhostUser app implements portions of the Virtio protocol for virtual ethernet I/O interfaces. In particular, VhostUser supports the virtio vring data structure for packet I/O in shared memory (DMA) and the Linux vhost API for creating vrings attached to tuntap devices.

With VhostUser SnabbSwitch can be used as a virtual ethernet interface by QEMU virtual machines. When connected via a UNIX socket, packets can be sent to the virtual machine by transmitting them on the rx port and packets sent by the virtual machine will arrive on the tx port.

VhostUser

Configuration

The VhostUser app accepts a table as its configuration argument. The following keys are defined:

— Key socket_path

Optional. A string denoting the path to the UNIX socket to connect on. Unless given all incoming packets will be dropped.

— Key is_server

Optional. Listen and accept an incoming connection on socket_path instead of connecting to it.

VirtioNet App (apps.virtio_net.virtio_net)

The VirtioNet app implements a subset of the driver part of the virtio-net specification. It can connect to a virtio-net device from within a QEMU virtual machine. Packets can be sent out of the virtual machine by transmitting them on the rx port, and packets sent to the virtual machine will arrive on the tx port.

VirtioNet

Configuration

The VirtioNet app accepts a table as its configuration argument. The following keys are defined:

— Key pciaddr

Required. The PCI address of the virtio-net device.

— Key use_checksum

Optional. Boolean value to enable the checksum offloading pre-calculations applied on IPv4/IPv6 TCP and UDP packets.

Pcap Savefile Apps

PcapReader and PcapWriter Apps (apps.pcap.pcap)

The PcapReader and PcapWriter apps can be used to inject and log raw packet data into and out of the app network using the Libpcap File Format. PcapReaderreads raw packets from a PCAP file and transmits them on its output port while PcapWriter writes packets received on its input port to a PCAP file.

PcapReader

Configuration

Both PcapReader and PcapWriter expect a filename string as their configuration arguments to read from and write to respectively. PcapWriter will alternatively accept an array as its configuration argument, with the first element being the filename and the second element being a mode argument to io.open.

Tap (apps.pcap.tap)

The Tap app is a simple in-band packet tap that writes packets that it sees to a pcap savefile. It can optionally only write packets that pass a pcap filter, and optionally subsample so it can write only every /n/th packet.

pcaptap

Configuration

The Tap app accepts a table as its configuration argument. The following keys are defined:

— Key filename

Required. The name of the file to which to write the packets.

— Key mode

Optional. Either "truncate" or "append", indicating whether the savefile will be truncated (the default) or appended to.

— Key filter

Optional. A pflang filter expression to select packets for tapping. Only packets that pass this filter will be sampled for the packet tap.

— Key sample

Optional. A sampling period. Defaults to 1, indicating that every packet seen by the tap and passing the optional filter string will be written. Setting this value to 2 will capture every second packet, and so on.

RawSocket App (apps.socket.raw)

The RawSocket app is a bridge between Linux network interfaces (eth0, lo, etc.) and a Snabb app network. Packets taken from the rx port are transmitted over the selected interface. Packets received on the interface are put on the tx port.

RawSocket

Configuration

The RawSocket app accepts a string as its configuration argument. The string denotes the interface to bridge to.

UnixSocket App (apps.socket.unix)

The UnixSocket app provides I/O for a named Unix socket.

UnixSocket

Configuration

The UnixSocket app takes a string argument which denotes the Unix socket file name to open, or a table with the fields:

filename - the Unix socket file name to open.
listen - if true, listen for incoming connections on the socket rather than connecting to the socket in client mode.
mode - can be “stream” or “packet” (the default is “stream”): the difference is that in packet mode, the packets are not split or merged (in both modes packets arrive in order).

NOTE: The socket is not opened until the first call to push() or pull(). If connection is lost, the socket will be re-opened on the next call to push() or pull().

Tap app (apps.tap.tap)

The Tap app is used to interact with a Linux tap device. Packets transmitted on the input port will be sent over the tap device, and packets that arrive on the tap device can be received on the output port.

Tap

Configuration

This app accepts either a single string or a table as its configuration argument. A single string is equivalent to the default configuration with the name attribute set to the string.

— Key name

Required. The name of the tap device.

If the device does not exist yet, which is inferred from the absence of the directory /sys/class/net/name, it will be created by the app and removed when the process terminates. Such a device is called ephemeral and its operational state is set to up after creation.

If the device already exists, it is called persistent. The app can attach to a persistent tap device and detaches from it when it terminates. The operational state is not changed. By default, the MTU is also not changed by the app, see the mtu_set option below.

One manner in which a persistent tap device can be created is by using the ip tool

ip tuntap add Tap345 mode tap
ip link set up dev Tap345
ip link set address 02:01:02:03:04:08 dev Tap0

— Key mtu

Optional. The L2 MTU of the device. The default is 1514.

By definition, the L2 MTU includes the size of the L2 header, e.g. 14 bytes in case of Ethernet without VLANs. However, the Linux ioctl methods only expose the L3 (IP) MTU, which does not include the L2 header. The following configuration options are used to correct this discrepancy.

— Key mtu_fixup

Optional. A boolean that indicates whether the mtu option should be corrected for the difference between the L2 and L3 MTU. The default is true.

— Key mtu_offset

Optional. The value by which the mtu is reduced when mtu_fixup is set to true. The default is 14.

The resulting MTU is called the effective MTU.

— Key mtu_set

Optional. Either nil or a boolean that indicates whether the MTU of the tap device should be set or checked. If mtu_set is true, the MTU of the tap device is set to the effective MTU. If mtu_set is false, the effective MTU is compared with the current value of the MTU of the tap device and an error is raised in case of a mismatch.

If mtu_set is nil, the MTU is set or checked if the tap device is ephemeral or persistent, respectively. The rationale is that if the device is persistent, the entity that created the device is responsible for the configuration and might not expect or react well to a change of the MTU.

VLAN Apps

There are three VLAN related apps, Tagger, Untagger and VlanMux. The Tagger and Untagger apps add or remove a VLAN tag whereas the VlanMux app can multiplex and demultiplex packets to different output ports based on tag.

Tagger (apps.vlan.vlan)

The Tagger app adds a VLAN tag, with the configured value and encapsulation, to packets received on its input port and transmits them on its output port.

Configuration

— Key encapsulation

Optional. The Ethertype to use as encapsulation for the VLAN. Permitted values are the strings “dot1q” and “dot1ad” or a number to select an arbitrary Ethertype. “dot1q” and “dot1ad” correspond to the Ethertypes 0x8100 and 0x88a8, respectively, according to the IEEE standards 802.1Q and 802.1ad.

If a number is given, it is truncated to 16 bits. This feature is intended to allow interoperation with vendors that do not use one of the standard encapsulations (a prominent example being the value 0x9100, which can still be found in practice for double-tagging instead of 0x88a8).

The default is “dot1q”.

— Key tag

Required. VLAN tag to add or remove from the packet. The value must be a number in the range 1-4094 (inclusive).

Untagger (apps.vlan.vlan)

The Untagger app checks packets received on its input port for a VLAN tag, removes it if it matches with the configured VLAN tag and transmits them on its output port. Packets with other VLAN tags than the configured tag are dropped.

Configuration

— Key encapsulation

Optional. See above.

— Key tag

Required. VLAN tag to add or remove from the packet. The value must be a number in the range 1-4094 (inclusive).

VlanMux (apps.vlan.vlan)

Despite the name, the VlanMux app can act both as a multiplexer, i.e. receive packets from multiple different input ports, add a VLAN tag and transmit them out onto one, as well as receiving packets from its trunk port and demultiplex it over many output ports based on the VLAN tag of the received packet. It supports the notion of a “native VLAN” by mapping untagged frames on the trunk port to a dedicated output port.

A packet received on its trunk input port must either be untagged or tagged with the encapsulation as specified with the encapsulation configuration option. Otherwise, the packet is dropped.

If the Ethernet frame is tagged, the VLAN ID is extracted and the packet is transmitted on the port named vlan<vid>, where <vid> is the decimal representation of the VLAN ID. If no such port exists, the packet is dropped.

If the Ethernet frame is untagged, it is transmitted on the port named native or dropped if no such port exists.

A packet received on a port named vlan<vid> is tagged with the VLAN ID <vid> according to the configured encapsulation and transmitted on the trunk port.

A packet received on the port named native is transmitted as is on the trunk port.

Configuration

— Key encapsulation

Optional. See above.

Bridge Apps

A bridge app implements a basic Ethernet bridge with split-horizon semantics. It has an arbitrary number of ports. For each input port there must exist an output port with the same name. Each port name is a member of at most one split-horizon group. If it is not a member of a split-horizon group, the port is also called a free port. Packets arriving on a free input port may be forwarded to all other output ports. Packets arriving on an input port that belongs to a split-horizon group are never forwarded to any output port belonging to the same split-horizon group. There are two bridge implementations available: apps.bridge.flooding and apps.bridge.learning.

bridge

Configuration

A bridge app accepts a table as its configuration argument. The following keys are defined:

— Key ports

Optional. An array of free port names. The default is no free ports.

— Key split_horizon_groups

Optional. A table mapping split-horizon groups to arrays of port names. The default is no split-horizon groups.

— Key config

Optional. The configuration of the actual bridge implementation.

Flooding bridge (apps.bridge.flooding)

The flooding bridge app implements the simplest possible bridge, which floods a packet arriving on an input port to all output ports within its scope according to the split-horizon topology.

Configuration

The flooding bridge app ignores the config key of its configuration.

Learning bridge (apps.bridge.learning)

The learning bridge app implements a learning bridge using a custom hash table to store the set of MAC source addresses of packets arriving on each input port. When a packet is received it is forwarded to all output ports whose corresponding input ports match the packet’s destination MAC address. When no input port matches, the packet is flooded to all output ports. Multicast MAC addresses are always flooded to all output ports associated with the input port. The scoping rules according to the split-horizon topology apply unchanged.

Configuration

The learning bridge app accepts a table as the value of the config key of its configuration. The following keys are defined:

— Key mac_table

Optional. This is a table that defines the characteristics of the MAC table. The following keys are defined

— Key size

Optional. The number of MAC addresses to be stored in the table. Default is 256. The size of the table is increased automatically if this limit is reached or if an overflow in one of the hash buckets occurs. This value is capped by resize_max.

— Key timeout

Optional. Timeout for learned MAC addresses in seconds. Default is 60.

— Key verbose

Optional. A boolean value. If true, statistics about table usage is logged during each timeout interval. Default is false.

— Key copy_on_resize

Optional. A boolean value. If true, the contents of the table is copied to the newly allocated table after a resize operation. Default is true.

— Key resize_max

Optional. An upper bound for the size of the table. Default is 65536.

IPFIX and NetFlow apps

IPFIX (apps.ipfix.ipfix)

The IPFIX app implements an RFC 7011 IPFIX “meter” and “exporter” that records the flows present in incoming traffic and sends exported UDP packets describing those flows to an external collector (not included). The exporter can produce output in either the standard RFC 7011 IPFIX format, or the older NetFlow v9 format from RFC 3954.

IPFIX

See the snabb ipfix probe command-line interface for a program built using this app.

Configuration

The IPFIX app accepts a table as its configuration argument. The following keys are defined:

— Key idle_timeout

Optional. Number of seconds after which a flow should be considered idle and available for expiry. The default is 300 seconds.

— Key active_timeout

Optional. Period at which an active, non-idle flow should produce export records. The default is 120 seconds.

— Key cache_size

Optional. Initial size of flow tables, in terms of number of flows. The default is 20000.

— Key template_refresh_interval

Optional. Period at which to send template records over UDP. The default is 600 seconds.

— Key ipfix_version

Optional. Version of IPFIX to export. 9 indicates legacy NetFlow v9; 10 indicates RFC 7011 IPFIX. The default is 10.

— Key mtu

Optional. MTU for exported UDP packets. The default is 512.

— Key observation_domain

Optional. Observation domain tag to attach to all exported packets. The default is 256.

— Key exporter_ip

Required, sadly. The IPv4 address from which to send exported UDP packets.

— Key collector_ip

Required. The IPv4 address to which to send exported UDP packets.

— Key collector_port

Required. The port on which the collector is listening for UDP packets.

— Key templates

Optional. The templates for flows being collected. See the source code for more information.

To-do list

Some ideas for things to hack on are below.

Limit the number of flows

As it is, if an attacker can create millions of flows, then our flow set will expand to match (and never shrink). Perhaps we should cap the total size of the flow table.

Look up multiple keys in parallel

For large ctables, we can only do 7 or 8 million lookups per second if we look up one key after another. However if we do lookups in parallel, then we can get 15 million or so, which would allow us to reach 10Gbps line rate on 64-byte packets.

YANG schema to define IPFIX app configuration

We should try to model the configuration of the IPFIX app with a YANG schema. See RFC 6728 for some inspiration.

Use special-purpose internal links

The links that we use as internal buffers between parts of the IPFIX app have some overhead as they have to update counters. Perhaps we should use a special-purpose data structure.

Use a monotonic timer

Currently internal flow start and end times use UNIX time. This isn’t great for timers, but it does match what’s specified in RFC 7011. Could we switch to monotonic time?

Allow export to IPv6 collectors

We can collect IPv6 flows of course, but we only export to collectors over IPv4 for the moment.

Allow packets to count towards multiple templates

Right now, routing a packet towards a flow set means no other flow set can measure that packet. Perhaps this should change.

IPsec Apps

ESP Transport6 and Tunnel6 (apps.ipsec.esp)

The Transport6 and Tunnel6 apps implement ESP in transport and tunnel mode respectively. they encrypts packets received on their decapsulated port and transmit them on their encapsulated port, and vice-versa. Packets arriving on the decapsulated port must have Ethernet and IPv6 headers, and packets arriving on the encapsulated port must have an Ethernet and IPv6 headers followed by an ESP header, otherwise they will be discarded.

Transport6

References:

lib.ipsec.esp

Configuration

The Transport6 and Tunnel6 apps accepts a table as its configuration argument. The following keys are defined:

— Key self_ip (Tunnel6 only)

Required. Source address of the encapsulating IPv6 header.

— Key nexthop_ip (Tunnel6 only)

Required. Destination address of the encapsulating IPv6 header.

— Key aead

Optional. The identifier of the AEAD to use for encryption and authentication. For now, only the default "aes-gcm-16-icv" (AES-GCM with a 16 octet ICV) is supported.

— Key spi

Required. A 32 bit integer denoting the “Security Parameters Index” as specified in RFC 4303.

— Key transmit_key

Required. Hexadecimal string of 32 digits (two digits for each byte) that denotes a 128-bit AES key as specified in RFC 4106 used for the encryption of outgoing packets.

— Key transmit_salt

Required. Hexadecimal string of eight digits (two digits for each byte) that denotes four bytes of salt as specified in RFC 4106 used for the encryption of outgoing packets.

— Key receive_key

Required. Hexadecimal string of 32 digits (two digits for each byte) that denotes a 128-bit AES key as specified in RFC 4106 used for the decryption of incoming packets.

— Key receive_salt

Required. Hexadecimal string of eight digits (two digits for each byte) that denotes four bytes of salt as specified in RFC 4106 used for the decryption of incoming packets.

— Key receive_window

Optional. Minimum width of the window in which out of order packets are accepted as specified in RFC 4303. The default is 128.

— Key resync_threshold

Optional. Number of consecutive packets allowed to fail decapsulation before attempting “Re-synchronization” as specified in RFC 4303. The default is 1024.

— Key resync_attempts

Optional. Number of attempts to re-synchronize a packet that triggered “Re-synchronization” as specified in RFC 4303. The default is 8.

— Key auditing

Optional. A boolean value indicating whether to enable or disable “Auditing” as specified in RFC 4303. The default is nil (no auditing).

Test Apps

Match (apps.test.match)

The Match app compares packets received on its input port rx with those received on the reference input port comparator, and reports mismatches as well as packets from comparator that were not matched.

Match

— Method Match:errors

Returns the recorded errors as an array of strings.

Configuration

The Match app accepts a table as its configuration argument. The following keys are defined:

— Key fuzzy

Optional. If this key is true packets from rx that do not match the next packet from comparator are ignored. The default is false.

— Key modest

Optional. If this key is true unmatched packets from comparator are ignored if at least one packet from ´rx´ was successfully matched. The default is false.

Synth (apps.test.synth)

The Synth app generates synthetic packets with Ethernet headers and alternating payload sizes. On each breath it fills each attached output link with new packets.

Synth

Configuration

The Synth app accepts a table as its configuration argument. The following keys are defined:

— Key src

— Key dst

Source and destination MAC addresses in human readable from. The default is "00:00:00:00:00:00".

— Key sizes

An array of numbers designating the packet payload sizes. The default is {64}.

— Key random_payload

Generate a random payload for each packet in sizes.

— Key packet_id

Insert the packet number (32bit uint) directly after the ethertype. The packet number starts at 0 and is sequential on each output link.

Npackets (apps.test.npackets)

The Npackets app allows are most N packets to flow through it. Any further packets are never dequeued from input.

input -> | Npackets | -> output +———–+

Configuration

The Npackets app accepts a table as its configuration argument. The following keys are defined:

— Key npackets The number of packets to forward, further packets are never dequeued from input.

SnabbWall Apps

L7Spy (apps.wall.l7spy)

L7Spy

The L7Spy app is a Snabb app that scans packets passing through it using an instance of the Scanner class. The scanner instance may be shared among several L7Spy instances or with a L7Fw app for filtering.

— Method L7Spy:new config

Construct a new L7Spy app instance based on a given configuration table. The table may contain the following key:

scanner (optional): Either a string identifying the kind of scanner to construct (currently only "ndpi" is accepted) or an existing scanner instance.

Filter (apps.wall.filter)

L7Fw

The L7Fw app implements a stateful firewall by querying the scanner state collected by a L7Spy app. It then filters packets based on a given set of rules.

— Method L7Fw:new config

Construct a new L7Fw app instance based on a given configuration table. The table may contain the following keys:

scanner: A Scanner instance shared with an L7Spy instance. The metadata in this scanner is used for packet filtering.
rules: A table mapping protocol names (as strings) to firewall actions. The accepted actions are "accept", "reject", "drop", or a pfmatch expression. The pfmatch expression may use the variable flow_count (as an arithmetic expression) to refer to the number of packets in a given protocol flow, and may call the accept, reject, or drop methods.
local_ipv4 (optional): An IPv4 address that identifies the host running the firewall. This is used as the source address in ICMPv4 or TCP reject responses.
local_ipv6 (optional): An IPv6 address that identifies the host running the firewall. This is used as the source address in ICMPv6 or TCP reject responses.
local_macaddr (optional): A MAC address that identifies the host running the firewall. This is used for the source address in ethernet frames for reject responses.
logging (optional): A log level parameter that can be set to “on” or “off”. When set to “on”, it will report dropped/rejected packets to the system log.

Scanner (apps.wall.scanner)

Scanner objects are responsible for:

Identifying traffic flows.
Analyzing the contents of packet to determine which application they belong to.
Keeping enough state to be able to enumerate the identified traffic flows, and identify the application they belong to.

The class is not meant to be instantiated directly, but to be used as the basis for concrete implementations (e.g. NdpiScanner). It provides one function that subclasses can use:

Method Scanner:extract_packet_info packet

Extracts fields from the headers of an IPv4 or IPv6 packet. The returned values are:

A key object (more on this below) which uniquely identifies a traffic flow.
The offset to the packet payload content.
The source_address of the packet, as an array of bytes.
The source_port, for UDP and TCP packets.
The destination_address of the packet, as an array of bytes.
The destination_port, for UDP and TCP packets.

Key objects contain some of the returned information in a compact FFI representation, and can be used as an aid to uniquely identify a flow of packets. The provide the following attributes:

:eth_type(): Method which returns the type of the Ethernet frame payload, either ETH_TYPE_IPv4 or ETH_TYPE_IPv6.
:hash(): Method which returns an integer calculated by hashing all the other values in the key object.
.vlan_id: VLAN identifier. Zero for no VLAN tags.
.ip_proto: The IP protocol.
.lo_addr and .hi_addr: IP addresses (either v4 or v6).
.lo_port and .hi_port: For TCP and UDP, the ports as big-endian (network) integers.

This method can be very useful to implement scanners using backends which do not implement their own flow classification.

Subclassing

All the Scanner implementations conform to the Scanner base API.

— Method Scanner:scan_packet packet, time

Scans a packet.

The time parameter is used to know at which time (in seconds from the Epoch) packet has been received for processing. A suitable value can be obtained using engine.now().

— Method Scanner:get_flow packet

Obtains the traffic flow for a given packet. If the packet is determined to not match any of the detected flows, nil is returned. The returned flow object has at least the following fields:

protocol: The L7 protocol for the flow. A user-visible string can be obtained by passing this value to Scanner:protocol_name().
packets: Number of packets scanned which belong to the traffic flow.
last_seen: Last time (in seconds from the Epoch) at which a packet belonging to the flow has been scanned.

— Method Scanner:flows

Returns an iterator over all the traffic flows detected by the scanner. The returned value is suitable to be used in a for-loop:

for flow in my_scanner:flows() do
   -- Do something with "flow".
end

— Method Scanner:protocol_name protocol

Given a protocol identifier, returns a user-friendly name as a string. Typically the protocol is obtained flow objects returned by Scanner:get_flow().

NdpiScanner (apps.wall.scanner.ndpi)

NdpiScanner uses the nDPI library (via the ljndpi FFI binding) to scan packets and determine L7 traffic flows. The nDPI library (libndpi.so) must be available in the host system. Versions 1.7 and 1.8 are supported.

— Method NdpiScanner:new ticks_per_second

Creates a new scanner, with a ticks_per_second resolution.

Utilities

The apps.wall.util module contains miscellaneous utilities.

— Function util.ipv4_addr_cmp a, b

Compares two IPv4 addresses a and b. The returned value follows the same convention as for C.memcmp(): zero if both addresses are equal, or an integer value with the same sign as the sign of the difference between the first pair of bytes that differ in a and b.

— Function util.ipv6_addr_cmp a, b

Compares two IPv6 addresses a and b. The returned value follows the same convention as for C.memcmp(): zero if both addresses are equal, or an integer value with the same sign as the sign of the difference between the first pair of bytes that differ in a and b.

SouthAndNorth (apps.wall.util)

The SouthAndNorth application is not to mean to be used directly, but rather as a building block for more complex applications which need two duplex ports (south and north) which forward packets between them, optionally doing some intermediate processing.

Packets arriving to the north port are passed to the :on_southbound_packet() method —which can be overriden in a subclass—, and forwarded to the south port. Conversely, packets arriving to the south port are passed to :on_northbound_packet() method, and finally forwarded to the north port.

SouthAndNorth

The value returnbyed :on_southbound_packet() and :on_northbound_packet() determines what will be done to the packet being processed:

Returning false discards the packet: the packet will not be forwarded, and packet.free() will be called on it.
Returning a different packet replaces the packet: the packet originally being processed is discarded, packet.free() called on it, and the returned packet is forwarded.
Returning the same packet being handled will forward it. Retuning nil achieves the same effect.

Example

The following snippet defines an application derived from SouthAndNorth which silently discards packets bigger than a certain size, and keeps a count of how many packets have been discarded and forwarded:


-- Setting SouthAndNorth as metatable "inherits" from it.
DiscardBigPackets = setmetatable({},
   require("apps.wall.util").SouthAndNorth)

function DiscardBigPackets:new (max_length)
   return setmetatable({
      max_packet_length = max_length,
      discarded_packets = 0,
      forwarded_packets = 0,
   }, self)
end

function DiscardBigPackets:on_northbound_packet (pkt)
   if pkt.length > self.max_packet_length then
      self.discarded_packets = self.discarded_packets + 1
      return false
   end
   self.forwarded_packets = self.forwarded_packets + 1
end

-- Apply the same logic for packets in the other direction.
DiscardBigPackets.on_southbound_packet =
   DiscardBigPackets.on_northbound_packet

RSS app (apps.rss.rss)

The rss app implements the basic functionality needed to provide generic receive side scaling to other apps. In essence, the rss app takes packets from an arbitrary number n of input links and distributes them to an arbitrary number m of output links

rss

The distribution algorithm has the property that all packets belonging to the same flow are guaranteed to be mapped to the same output link, where a flow is identified by the value of certain fields of the packet header, depending on the type of packet.

For IPv4 and IPv6, the basic classifier is given by the 3-tuple (source address, destination address, protocol), where protocol is the value of the protocol field of the IPv4 header or the value of the next-header field that identifies the “upper-layer protocol” of the IPv6 header (which may be preceeded by any number of extension headers).

If the protocol is either TCP (protocol #6), UDP (protocol #17) or SCTP (protocol #132), the list of header fields is augmented by the port numbers to yield the 5-tuple (source address, destination address, protocol, source port, destination port).

The output link is determined by applying a hash function to the set of header fields

out_link = ( hash(flow_fields) % m ) + 1

All other packets are not classified into flows and are always mapped to the first output link.

The actual scaling property is achieved by running the receivers in separate processes and use specialized inter-process links to connect them to the rss app.

In addition to this basic functionality, the rss app also implements the following set of extensions.

Flow-director

The output links can be grouped into equivalence classes with respect to matching conditions in terms of arbitrary pflang expressions as provided by the pf module. Matching packets are only distributed to the output links that belong to the equivalence class. By default, a single equivalence class exists which matches all packets. It is special in the sense that the matching condition cannot be expressed in pflang. This default class is the only one that can receive non-IP packets.

Classes are specified in an explicit order when an instance of the rss app is created. The default class is created implicitly as the last element in the list. Each packet is matched against the filter expressions, starting with the first one. If a match is found, the packet is assigned to the corresponding equivalence class and processing of the list stops.

The default class can be disabled by configuration. In that case, packets not assigned to any class are dropped.

Packet replication

The standard flow-director assigns a packet to at most one class. Any class can also be marked with the attribute continue to allow matches to multiple classes. When a packet is matched to such a class, it is distributed to the set of ouput links associated with that class but processing of the remaining filter expressions continues. If the packet matches a subsequent class, a copy is created and distributed to the corresponding set of output links. Processing stops when the packet matches a class that does not have the continue attribute.

Weighted links

By default, all output links in a class are treated the same. In other words, if the input consists of a sufficiently large sample of random flows, all links will receive about the same share of them. It is possible to introduce a bias for certain links by assigning a weight to them, given by a positive integer w. If the number of links is m and the weight of link i (1 <= i <= m) is w_i, the share of traffic received by it is given by

share_i = w_i/(w_1 + w_2 + ... + w_m)

For example, if m = 2 and w_1 = 1, w_2 = 2, link #1 will get 1/3 and link #2 will get 2/3 of the traffic.

Packet meta-data

In order to compute the hash over the header fields, the rss app must parse the packets to a certain extent. Internally, the result of this analysis is appended as a block of data to the end of the actual packet data. Because this data can be useful to other apps downstream of the rss app, it is exposed as part of the API.

The meta-data is structured as follows

   struct {
      uint16_t magic;
      uint16_t ethertype;
      uint16_t vlan;
      uint16_t total_length;
      uint8_t *filter_start;
      uint16_t filter_length;
      uint8_t *l3;
      uint8_t *l4;
      uint16_t filter_offset;
      uint16_t l3_offset;
      uint16_t l4_offset;
      uint8_t  proto;
      uint8_t  frag_offset;
      int16_t  length_delta;
   }

magic

This field contains the constant 0x5abb to mark the start of a valid meta-data block. The get API function asserts that this value is correct.

ethertype

This is the Ethertype contained in the Ethernet header of the packet. If the frame is of type 802.1q, i.e. the Ethertype is 0x8100, the ethertype field is set to the effective Ethertype following the 802.1q header. Only one level of tagging is recognised, i.e. for double-tagged frames, ethertype will contain the value 0x8100.

vlan

If the frame contains a 802.1q tag, vlan is set to the value of the VID field of the 802.1q header. Otherwise it is set to 0.

total_length

If ethertype identifies the frame as either a IPv4 or IPv6 packet (i.e. the values 0x0800 and 0x86dd, respectively), total_length is the size of the L3 payload of the Ethernet frame according to the L3 header, including the L3 header itself. For IPv4, this is the value of the header’s Total Length field. For IPv6, it is the sum of the header’s Payload Length field and the size of the basic header (40 bytes).

For all other values of ethertype, total_length is set to the effective size of the packet (according to the length field of the packet data structure) minus the the size of the Ethernet header (14 bytes for untagged frames and 18 bytes for 802.1q tagged frames).

filter_start

This is a pointer into the packet that can be passed as first argument to a BPF matching function generated by pf.compile_filter.

For untagged frames, this is a pointer to the proper Ethernet header.

For 802.1q tagged frames, an offset of 4 bytes is added to skip the 802.1q header. The reason for this is that the pf module does not implement the vlan primitive of the standard BPF syntax. The additional 4-byte offset places the effective Ethertype (i.e. the same value as in the ethertype meta-data field) at the position of an untagged Ethernet frame. Note that this makes the original MAC addresses unavailable to the filter.

filter_length

This value is the size of the chunk of data pointed to by filter_start and can be passed as second argument to a BPF matching function generated by pf.compile_filter. It is equal to the size of the packet if the frame is untagged or 4 bytes less than that if the frame is 802.1q tagged.

l3

This is a pointer to the start of the L3 header in the packet.

l4

This is a pointer to the start of the L4 header in the packet. For IPv4 and IPv6, it points to the first byte following the L3 header. For all other packets, it is equal to l3.

filter_offset, l3_offset, l4_offset

These values are the offsets of filter_start, l3, and l4 relative to the start of the packet. They are used by the copy API call to re-calculate the pointers after the meta-data block has been relocated.

proto

For IPv4 and IPv6, the proto field contains the identifier of the upper layer protocol carried in the payload of the packet. For all other packets, its value is undefined.

For IPv4, the upper layer protocol is given by the value of the Protocol field of the header. For IPv6, it is the value of the Next Header field of the last extension header in the packet’s header chain. The rss app recognizes the following protocol identifiers as extension headers according to the IANA ipv6-parameters registry

0 IPv6 Hop-by-Hop Option
43 Routing Header for IPv6
44 Fragment Header for IPv6
51 Authentication Header
60 Destination Options for IPv6
135 Mobility Header
139 Host Identity Protocol
140 Shim6 Protocol

Note that the protocols 50 (Encapsulating Security Payload, ESP), 253 and 254 (reserved for experimentation and testing) are treated as upper layer protocols, even though, technically, they are classified as extension headers.

frag_offset

For fragmented IPv4 and IPv6 packets, the frag_offset field contains the offset of the fragment in the original packet’s payload in 8-byte units. A value of zero indicates that the packet is either not fragmented at all or is the initial fragment.

For non-IP packets, the value is undefined.

length_delta

This field contains the difference of the packet’s effective length (as given by the length field of the packet data structure) and the size of the packet calculated from the IP header, i.e. the sum of l3_offset and total_length. For a regular packet, this difference is zero.

A negative value indicates that the packet has been truncated. A typical scenario where this is expected to occur is a setup involving a port-mirror that truncates packets either due to explicit configuration or due to a hardware limitation. The length_delta field can be used by a downstream app to determine whether it has received a complete packet.

A positive value indicates that the packet contains additional data which is not part of the protocol data unit. This is not expected to occur under normal circumstances. However, it has been observed that some devices perform this kind of padding when port-mirroring is configured with packet truncation and the mirrored packet is smaller than the truncation limit.

For non-IP packets, length_delta is always zero.

IPv6 extension header elimination

The pf module does not implement the protochain primitive for IPv6. The only extension header it can deal with is the fragmentation header (protocol 44). As a consequence, packets containing arbitrary extension headers can not be matched against filter expressions.

To overcome this limitation, the meta-data generator of the rss app removes all extension headers from a packet by default, leaving only the basic IPv6 header followed immediately by the upper layer protocol. The values of the Payload Length and Next Header fields of the basic IPv6 header as well as the packet length are adjusted accordingly.

VLAN pseudo-tagging

Since the rss app can accept packets from multiple sources, the information on which link the packet was received is not trivially available to receiving apps unless the packets contain a unique identifier of some sort, e.g. a particular VLAN tag. If such an identifier is not available, the rss app can be configured to attach a pseudo VLAN tag to packets arriving on a particular input link. It is called “pseudo tagging” because the VLAN is only added to the packet’s meta-data, not the packet itself. As a consequence, a receiving app only sees this kind of tag when it examines the meta-data provided by the rss app. Such a pseudo-tag also overrides any native VLAN tag that a packet might have.

The pseudo-tagging is enabled by following a convention for the naming of input links as described below.

If proper VLAN tagging is required, the vlan.vlan.Tagger app can be pushed between the packet source and the input link.

Configuration

The rss app accepts the following arguments.

— Key default_class

Optional. A boolean that specifies whether the default filter class should be enabled. The default is true. The name of the default class is default.

— Key classes

Optional. An ordered list of class specifications. Each specification must be a table with the following keys.

Key name

Required. The name of the class. It must be unique among all classes and it must match the Lua regular expression %w+.
Key filter

Required. A string containing a pflang filter expression.
Key continue

Optional. A boolean that specifies whether processing of classes should continue if a packet has matched the filter of this class. The default is false.

— Key remove_extension_headers

Optional. A boolean that specifies whether IPv6 extension headers shoud be removed from packets. The default is true.

The classes configuration option specifies the set of classes known to an instance of the rss app. The assignment of links to classes is done implicitly by connecting other apps using the convention <class>_<instance> for the name of the links, where <class> is the name of the class to which the links should be assigned exactly as specified by the name parameter of the class definition. The <instance> specifier can be any string (adhering to the naming convention for links) that distinguishes the links within a class.

If the instance specifier is formatted as <instance>_<weight>, where <instance> is restricted to the pattern %w+ and <weight> must be a number, the link’s weight is set to the value <weight>. The default weight for a links is 1.

If the rss app detects an output link whose name does not match any of the configured classes, it issues a warning message and ignores the link. Classes to which no output links are assigned are ignored.

The names of the input links are arbitrary unless the VLAN pseudo-tagging feature should be used. In that case, the link must be named vlan<vlan-id>, where <vlan-id> must be a number between 1 and 4094 and will be placed in the <vlan> meta-data field of every packet received on the link (irrespective of whether the packet has a real VLAN ID or not).

Meta-data API

The meta-data functionality is provided by the module apps.rss.metadata and provides the following API.

— Function add packet, remove_extension_headers, vlan

Analyzes packet and adds a meta-data block starting immediately after the packet data. If the boolean remove_extension_headers is true, IPv6 extension headers are removed from the packet. The optional vlan overrides the value of the vlan meta-data field extracted from the packet, irrespective of whether the packet actually has a tag or not.

An error is raised if there is not enough room for the mata-data block in the packet.

— Function get packet

Returns a pointer to the meta-data in packet. An error is raised if the meta-data block does not start with the magic number (0x5abb).

— Function copy packet

Creates a copy of packet including the meta-data block. Returns a pointer to the new packet.

Inter-process links (apps.interlink.*)

The “interlink” transmitter and receiver apps allow for efficient exchange of packets between Snabb processes within the same process group (see Multiprocess operation (core.worker)).

Transmitter

To make packets from an output port available to other processes, configure a transmitter app, and link the appropriate output port to its input port.

local Transmitter = require("apps.interlink.transmitter")

config.app(c, "interlink", Transmitter)
config.link(c, "myapp.output -> interlink.input")

Then, in the process that should receive the packets, configure a receiver app with the same name, and link its output port as suitable.

local Receiver = require("apps.interlink.receiver")

config.app(c, "interlink", Receiver)
config.link(c, "interlink.output -> otherapp.input")

Subsequently, packets transmitted to the transmitter’s input port will appear on the receiver’s output port.

Alternatively, a name can be supplied as a configuration argument to be used instead of the app’s name:

config.app(c, "mylink", Receiver, "interlink")
config.link(c, "mylink.output -> otherapp.input")

Configuration

The configured app names denote globally unique queues within the process group. Alternativelyy, the receiver and transmitter apps can instead be passed a string that names the shared queue to which to attach to.

Starting either the transmitter or receiver app attaches them to a shared packet queue visible to the process group under the name that was given to the app. When the queue identified by the name is unavailable, because it is already in use by a pair of processes within the group, configuration of the app network will block until the queue becomes available. Once the transmitter or receiver apps are stopped they detach from the queue.

Only two processes (one receiver and one transmitter) can be attached to an interlink queue at the same time, but during the lifetime of the queue (e.g., from when the first process attached to when the last process detaches) it can be shared by any number of receivers and transmitters. Meaning, either process attached to the queue can be restarted or replaced by another process without packet loss.

Libraries

IP checksum (lib.checksum)

The checksum module provides an optimized ones-complement checksum routine.

— Function ipsum pointer length initial

Return the ones-complement checksum for the given region of memory.

pointer is a pointer to an array of data to be checksummed. initial is an unsigned 16-bit number in host byte order which is used as the starting value of the accumulator. The result is the IP checksum over the data in host byte order.

The initial argument can be used to verify a checksum or to calculate the checksum in an incremental manner over chunks of memory. The synopsis to check whether the checksum over a block of data is equal to a given value is the following

if ipsum(pointer, length, value) == 0 then
  -- checksum correct
else
  -- checksum incorrect
end

To chain the calculation of checksums over multiple blocks of data together to obtain the overall checksum, one needs to pass the one’s complement of the checksum of one block as initial value to the call of ipsum() for the following block, e.g.

local sum1 = ipsum(data1, length1, 0)
local total_sum = ipsum(data2, length2, bit.bnot(sum1))

This function takes advantage of SIMD hardware when available.

Ctable (lib.ctable)

A ctable is a hash table whose keys and values are instances of FFI data types. In Lua parlance, an FFI value is a “cdata” value, hence the name “ctable”.

A ctable is parameterized for the specific types for its keys and values. This allows for the table to be stored in an efficient manner. Adding an entry to a ctable will copy the value into the table. Logically, the table “owns” the value. Lookup can either return a pointer to the value in the table, or copy the value into a user-supplied buffer, depending on what is most convenient for the user.

To create a ctable, first create a parameters table specifying the key and value types, along with any other options. Then call ctable.new on those parameters. For example:

local ctable = require('lib.ctable')
local ffi = require('ffi')
local params = {
   key_type = ffi.typeof('uint32_t'),
   value_type = ffi.typeof('int32_t[6]'),
   max_occupancy_rate = 0.4,
   initial_size = math.ceil(occupancy / 0.4)
}
local ctab = ctable.new(params)

— Function ctable.new parameters

Create a new ctable. parameters is a table of key/value pairs. The following keys are required:

key_type: An FFI type (LuaJIT “ctype”) for keys in this table.
value_type: An FFI type (LuaJT “ctype”) for values in this table.

Optional entries that may be present in the parameters table include:

hash_seed: A hash seed, as a 16-byte array. The hash value of a key is a function of the key and also of the hash seed. Using a hash function with a seed prevents some kinds of denial-of-service attacks against network functions that use ctables. The seed defaults to a fresh random byte string. The seed also changes whenever a table is resized.
initial_size: The initial size of the hash table, including free space. Defaults to 8 slots.
max_occupancy_rate: The maximum ratio of occupancy/size, where occupancy denotes the number of entries in the table, and size is the total table size including free entries. Trying to add an entry to a “full” table will cause the table to grow in size by a factor of

Defaults to 0.9, for a 90% maximum occupancy ratio.

min_occupancy_rate: Minimum ratio of occupancy/size. Removing an entry from an “empty” table will shrink the table.
resize_callback: An optional function that is called after the table has been resized. The function is called with two arguments: the ctable object and the old size. By default, no callback is used.

— Function ctable.load stream parameters

Load a ctable that was previously saved out to a binary format. parameters are as for ctable.new. stream should be an object that has a :read_ptr(ctype) method, which returns a pointer to an embedded instances of ctype in the stream, advancing the stream over the object; and :read_array(ctype, count) which is the same but reading count instances of ctype instead of just one.

Methods

Users interact with a ctable through methods. In these method descriptions, the object on the left-hand-side of the method invocation should be a ctable.

— Method :resize size

Resize the ctable to have size total entries, including empty space.

— Method :add key, value, updates_allowed

Add an entry to the ctable, returning the index of the added entry. key and value are FFI values for the key and the value, of course.

updates_allowed is an optional parameter. If not present or false, then the :insert method will raise an error if the key is already present in the table. If updates_allowed is the string "required", then an error will be raised if key is not already in the table. Any other true value allows updates but does not require them. An update will replace the existing entry in the table.

Returns a pointer to the inserted entry. Any subsequent modification to the table may invalidate this pointer.

— Method :update key, value

Update the entry in a ctable with the key key to have the new value value. Throw an error if key is not present in the table.

— Method :lookup_ptr key

Look up key in the table, and if found return a pointer to the entry. Return nil if the value is not found.

An entry pointer has three fields: the hash value, which must not be modified; the key itself; and the value. Access them as usual in Lua:

local ptr = ctab:lookup(key)
if ptr then print(ptr.value) end

Note that pointers are only valid until the next modification of a table.

— Method :lookup_and_copy key, entry

Look up key in the table, and if found, copy that entry into entry and return true. Otherwise return false.

— Method :remove_ptr entry

Remove an entry from a ctable. entry should be a pointer that points into the table. Note that pointers are only valid until the next modification of a table.

— Method :remove key, missing_allowed

Remove an entry from a ctable, keyed by key.

Return true if we actually do find a value and remove it. Otherwise if no entry is found in the table and missing_allowed is true, then return false. Otherwise raise an error.

— Method :save stream

Save a ctable to a byte sink. stream should be an object that has a :write_ptr(ctype) method, which writes an instance of a struct type out to a stream, and :write_array(ctype, count) which is the same but writing count instances of ctype instead of just one.

— Method :selfcheck

Run an expensive internal diagnostic to verify that the table’s internal invariants are fulfilled.

— Method :dump

Print out the entries in a table. Can be expensive if the table is large.

— Method :iterate

Return an iterator for use by for in. For example:

for entry in ctab:iterate() do
   print(entry.key, entry.value)
end

Streaming interface

As an implementation detail, the table is stored as an open-addressed robin-hood hash table with linear probing. Ctables use the high-quality SipHash hash function to allow for good distribution of hash values. To find a value associated with a key, a ctable will first hash the key, map that hash value to an index into the table by scaling the hash to the table size, and then scan forward in the table until we find an entry whose hash value is greater than or equal to the hash in question. Each entry stores its hash value, and empty entries have a hash of 0xFFFFFFFF. If the entry’s hash matches and the entry’s key is equal to the one we are looking for, then we have our match. If the entry’s hash is greater than our hash, then we have a failure. Hash collisions are possible as well of course; in that case we continue scanning forward.

The distance travelled while scanning for the matching hash is known as the displacement. The table measures its maximum displacement, for a number of purposes, but you might be interested to know that a maximum displacement for a table with 2 million entries and a 40% load factor is around 8 or 9. Smaller tables will have smaller maximum displacements.

The ctable has two lookup interfaces. The first one is the lookup methods described above. The other interface will fetch all entries within the maximum displacement into a buffer, then do a branchless binary search over that buffer. This second streaming lookup can also fetch entries for multiple keys in one go. This can amortize the cost of a round-trip to RAM, in the case where you expect to miss cache for every lookup.

To perform a streaming lookup, first prepare a LookupStreamer for the batch size that you need. You will have to experiment to find the batch size that works best for your table’s entry sizes; for reference, for 32-byte entries a 32-wide lookup seems to be optimum.

-- Stream in 32 lookups at once.
local stride = 32
local streamer = ctab:make_lookup_streamer(stride)

Wiring up streaming lookup in a packet-processing network is a bit of a chore currently, as you have to maintain separate queues of lookup keys and packets, assuming that each lookup maps to a packet. Let’s make a little helper:

local lookups = {
   queue = ffi.new("struct packet * [?]", stride),
   queue_len = 0,
   streamer = streamer
}

local function flush(lookups)
   if lookups.queue_len > 0 then
      -- Here is the magic!
      lookups.streamer:stream()
      for i = 0, lookups.queue_len - 1 do
         local pkt = lookups.queue[i]
         if lookups.streamer:is_found(i)
            local val = lookups.streamer.entries[i].value
            --- Do something cool here!
         end
      end
      lookups.queue_len = 0
   end
end

local function enqueue(lookups, pkt, key)
   local n = lookups.queue_len
   lookups.streamer.entries[n].key = key
   lookups.queue[n] = pkt
   n = n + 1
   if n == stride then
      flush(lookups)
   else
      lookups.queue_len = n
   end
end

Then as you see packets, you enqueue them via enqueue, extracting out the key from the packet in some way and passing that value as the argument. When enqueue detects that the queue is full, it will flush it, performing the lookups in parallel and processing the results.

Poptrie (lib.poptrie)

An implementation of Poptrie. Includes high-level functions for building the Poptrie data structure, as well as a hand-written, optimized assembler lookup routine.

Example usage

local pt = poptrie.new{direct_pointing=true}
-- Associate prefixes of length to values (uint16_t)
pt:add(ipv4:pton("192.168.0.0"), 16, 1)
pt:add(ipv4:pton("192.0.0.0"), 8, 2)
pt:build()
pt:lookup32(ipv4:pton("192.1.2.3")) ⇒ 2
pt:lookup32(ipv4:pton("192.168.2.3")) ⇒ 1
-- The value zero denotes "no match"
pt:lookup32(ipv4:pton("193.1.2.3")) ⇒ 0
-- You can create a pre-built poptrie from its backing memory.
local pt2 = poptrie.new{
   nodes = pt.nodes,
   leaves = pt.leaves,
   directmap = pt.directmap
}

Performance

Note that performance tends to be memory-bound. The results below reflect ideal conditions with hot caches. See Benchmarking Poptrie.

Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz (Haswell, Turbo off)

PMU analysis (numentries=10000, numhit=100, keysize=32)
build: 0.1857 seconds
lookup: 8460.17 cycles/lookup 18089.70 instructions/lookup
lookup32: 62.71 cycles/lookup 99.99 instructions/lookup
lookup64: 64.11 cycles/lookup 100.00 instructions/lookup
lookup128: 74.44 cycles/lookup 118.66 instructions/lookup
build(direct_pointing): 0.1676 seconds
lookup(direct_pointing): 1306.68 cycles/lookup 3146.96 instructions/lookup
lookup32(direct_pointing): 35.49 cycles/lookup 62.61 instructions/lookup
lookup64(direct_pointing): 35.95 cycles/lookup 62.61 instructions/lookup
lookup128(direct_pointing): 37.75 cycles/lookup 66.81 instructions/lookup

Interface

— Function new init

Creates and returns a new Poptrie object.

Init is a table with the following keys:

direct_pointing - Optional. Boolean that governs whether to use the direct pointing optimization. Default is false.
s - Optional. Bits to use for the direct pointing optimization. Default is 18. Note that the direct map array will be 2×2ˢ bytes in size.
leaves - Optional. An array of leaves. When leaves is supplied nodes must be supplied as well.
nodes - Optional. An array of nodes. When nodes is supplied leaves must be supplied as well.
directmap - Optional. A direct map array. When directmap is supplied, nodes and leaves must be supplied as well and direct_pointing is implicit.

— Method Poptrie:add prefix length value

Associates value to prefix of length. Prefix must be a uint8_t * pointing to at least math.ceil(length/8) bytes. Length must be an integer equal to or greater than 1. Value must be a 16‑bit unsigned integer, and should be greater than zero (see lookup* as to why.)

— Method Poptrie:build

Compiles the optimized poptrie data structure used by lookup64. After calling this method, the leaves and nodes fields of the Poptrie object will contain the leaves and nodes arrays respectively. These arrays can be used to construct a Poptrie object.

— Method Poptrie:lookup32 key

— Method Poptrie:lookup64 key

— Method Poptrie:lookup128 key

Looks up key in the Poptrie object and returns the associated value or zero. Key must be a uint8_t * pointing to at least 4/8/16 bytes respectively.

Unless the Poptrie object was initialized with leaves and nodes arrays, the user must call Poptrie:build before calling Poptrie:lookup64.

It is an error to call these lookup routines on poptries that contain prefixes longer than supported by the individual lookup routine. I.e., you can only call lookup64 on poptries with prefixes of less than or equal to 64 bits.

PMU (lib.pmu)

The CPU’s PMU (Performance Monitoring Unit) collects information about specific performance events such as cache misses, branch mispredictions, and utilization of internal CPU resources like execution units. This module provides an API for counting events with the PMU.

Hundreds of low-level counters are available. The exact list depends on CPU model. See pmu_cpu.lua for our definitions.

High-level interface

— Function is_available

If the PMU hardware is available then return true. Otherwise return two values: false and a string briefly explaining why. (Cooperation from the Linux kernel is required to acess the PMU.)

— Function profile function [event_list] [aux]

Call function, return the result, and print a human-readable report of the performance events that were counted during execution.

— Function measure function [event_list]

Call function and return two values: the result and a table of performance event counter tallies.

Low-level interface

— Function setup event_list

Setup the hardware performance counters to track a given list of events (in addition to the built-in fixed-function counters).

Each event is a Lua string pattern. This could be a full event name:

mem_load_uops_retired.l1_hit

or a more general pattern that matches several counters:

mem_load.*l._hit

Return the number of overflowed counters that could not be tracked due to hardware constraints. These will be the last counters in the list.

Example:

setup({"uops_issued.any",
       "uops_retired.all",
       "br_inst_retired.conditional",
       "br_misp_retired.all_branches"}) => 0

— Function new_counter_set

Return a counter_set object that can be used for accumulating events. The counter_set will be valid only until the next call to setup().

— Function switch_to counter_set

Switch to a new set of counters to accumulate events in. Has the side-effect of committing the current accumulators to the previous record.

If counter_set is nil then do not accumulate events.

— Function to_table counter_set

Return a table containing the values accumulated in counter_set.

Example:

to_table(cs) =>
  {
   -- Fixed-function counters
   instructions                 = 133973703,
   cycles                       = 663011188,
   ref-cycles                   = 664029720,
   -- General purpose counters selected with setup()
   uops_issued.any              = 106860997,
   uops_retired.all             = 106844204,
   br_inst_retired.conditional  =  26702830,
   br_misp_retired.all_branches =       419
  }

— Function report counter_set [aux]

Print a textual report on the values accumulated in a counter set. Optionally include auxiliary application-level counters. The ratio of each event to each auxiliary counter is also reported.

Example:

report(my_counter_set, {packet = 26700000, breath = 208593})

prints output approximately like:

EVENT                                   TOTAL     /packet     /breath
instructions                      133,973,703       5.000     642.000
cycles                            663,011,188      24.000    3178.000
ref-cycles                        664,029,720      24.000    3183.000
uops_issued.any                   106,860,997       4.000     512.000
uops_retired.all                  106,844,204       4.000     512.000
br_inst_retired.conditional        26,702,830       1.000     128.000
br_misp_retired.all_branches              419       0.000       0.000
packet                             26,700,000       1.000     128.000
breath                                208,593       0.008       1.000

Snabb program configuration with YANG (`lib.yang`)

YANG is a data modelling language designed for use in networking equipment, standardized as RFC 6020. The lib.yang modules provide YANG facilities to Snabb applications, allowing operators to understand how to work with a Snabb data plane and also providing convenient configuration facilities for data-plane authors.

Overview

Everything in YANG starts with a schema: a specification of the data model of a device. For example, consider a simple Snabb router that receives IPv4 traffic and sends it out one of 12 ports. We might model it like this:

module snabb-simple-router {
  namespace snabb:simple-router;
  prefix simple-router;

  import ietf-inet-types {prefix inet;}

  leaf active { type boolean; default true; }

  container routes {
    list route {
      key addr;
      leaf addr { type inet:ipv4-address; mandatory true; }
      leaf port { type uint8 { range 0..11; } mandatory true; }
    }
  }
}

Given this schema, lib.yang can automatically derive a configuration file format for this Snabb program and create a parser that applies the validation constraints from the schema. The result is a simple plain-old-data Lua object that the data-plane can use directly.

Additionally there is support for efficient binary compilation of configurations. The problem is that even in this simple router, the routing table can grow quite large. While particular applications can sometimes incrementally update their configurations without completely reloading the configuration from the start, in general reloading is almost always a possibility, and you want to avoid packet loss during the time that the millions of routing table entries are loaded and validated.

For that reason the lib.yang code also defines a mapping that, given a YANG schema, can compile any configuration for that schema into a pre-validated binary file that the data-plane can just load up directly. Additionally for list nodes that map between keys and values, the lib.yang facilities can compile that map into an efficient ctable, letting the data-plane use the configuration as-is.

The schema given above can be loaded from a string using load_schema from the lib.yang.schema module, from a file via load_schema_file, or by name using load_schema_by_name. This last interface allows one to compile a YANG schema into the Snabb binary directly; if we name the above file snabb-simple-router.yang and place it in the src/lib/yang directory, then load_schema_by_name('snabb-simple-router') will find it appropriately. Indeed, this is how the ietf-inet-types import in the above example was resolved.

Configuration syntax

Consider again the example snabb-simple-router schema. To configure a router, we need to provide a configuration in a way that the application can understand. In Snabb, we derive this configuration syntax from the schema, in the following way:

A module’s configuration is composed of the configurations of all data nodes (container, leaf-list, list, and leaf) nodes inside it.
A leaf’s configuration is like keyword value;, where the keyword is the name of the leaf, and the value is in the right syntax for the leaf’s type. (More on value types below.)
A container’s configuration is the container’s keyword followed by the configuration of its data node children, like keyword { configuration... }.
A leaf-list’s configuration is a sequence of 0 or more instances of keyword value;, as in leaf.
A list’s configuration is a sequence of 0 or more instances of the form keyword { configuration... }, again where keyword is the list name and configuration... indicates the configuration of child data nodes.

Concretely, for the example configuration above, the above algorithm derives a configuration format of the following form:

(active true|false;)?
(routes {
  (route { addr ipv4-address; port uint8; })*
})?

In this grammar syntax, (foo)? indicates either 0 or 1 instances of foo, (foo)* is similar bit indicating 0 or more instances, and | expresses alternation.

An example configuration might be:

active true;
routes {
  route { addr 1.2.3.4; port 1; }
  route { addr 2.3.4.5; port 10; }
  route { addr 3.4.5.6; port 2; }
}

Except in special cases as described in RFC 6020, order is insignificant. You could have active false; at the end, for example, and route { addr 1.2.3.4; port 1; } is the same as route { port 1; addr 1.2.3.4; }.

The surface syntax of our configuration format is the same as for YANG schemas; "1.2.3.4" is the same as 1.2.3.4. Snabb follows the XML mapping guidelines of how to represent data described by a YANG schema, except that it uses YANG syntax instead of XML syntax. We could generate XML instead, but we want to avoid bringing in the complexities of XML parsing to Snabb. We also think that the result is a syntax that is pleasant and approachable to write by hand; we want to make sure that everyone can use the same configuration format, regardless of whether they are configuring Snabb via an external daemon like sysrepo or whether they write configuration files by hand.

Compiled configurations

Loading a schema and using it to parse a data file can be a bit expensive, especially if the data file includes a large routing table or other big structure. It can be useful to pay for this this parsing and validation cost “offline”, without interrupting a running data plane.

For this reason, Snabb support compiling configurations to binary data. A data plane can load a compiled configuration without any validation, very cheaply. Users can explicitly call the compile_config_for_schema or compile_config_for_schema_by_name functions. Support is planned also for automatic compilation and of source configuration files as well, so that the user can just edit configurations as text and still take advantage of the speedy binary configuration loads when nothing has changed.

Querying and updating configurations

[TODO] We will need to be able to serialize a configuration back to source, for when a user asks what the configuration of a device is. We will also need to serialize partial configurations, for when the user asks for just a part of the configuration.

[TODO] We will need to support updating the configuration of a running snabb application. We plan to compile the candidate configuration in a non-worker process, then signal the worker to reload its configuration.

[TODO] We will need to support incremental configuration updates, for example to add or remove a binding table entry for the lwAFTR. In this way we can avoid a full reload of the configuration, minimizing packet loss.

State data

[TODO] We need to map the state data exported by a Snabb process (counters, etc) to YANG-format data. Perhaps this can be done in a similar way as configuration compilation: the configuration facility in the Snabb binary compiles a YANG state data file and periodically updates it by sampling the data plane, and then we re-use the configuration serialization facilities to serialize (potentially partial) state data.

API reference

The public entry point to the YANG library is the lib.yang.yang module, which exports the following bindings. Note that unless you have special needs, probably the only one you want to use is load_configuration.

— Function load_configuration filename parameters

Load a configuration from disk. If filename is a compiled configuration, load it directly. Otherwise it must be a source file. In that case, try to load a corresponding compiled file instead if possible. If all that fails, actually parse the source configuration, and try to residualize a corresponding compiled file so that we won’t have to go through the whole thing next time.

parameters is a table of key/value pairs. The following key is required:

schema_name: The name of the YANG schema that describes the configuration. This is the name that appears as the id in module id { ... } in the schema.

Optional entries that may be present in the parameters table include:

verbose: Set to true to print verbose information about which files are being loaded and compiled.
revision_date: If set, assert that the loaded configuration was built against this particular schema revision date.

For more information on the format of the returned value, see the documentation below for load_config_for_schema.

— Function load_schema src filename

Load a YANG schema from the string src. filename is an optional file name for use in error messages. Returns a YANG schema object.

Schema objects do have useful internal structure but they are not part of the documented interface.

— Function load_schema_file filename

Load a YANG schema from the file named filename. Returns a YANG schema object.

— Function load_schema_by_name name revision

Load the given named YANG schema. The name indicates the canonical name of the schema, which appears as module *name* { ... } in the YANG schema itself, or as import *name* { ... } in other YANG modules that import this module. revision optionally indicates that a certain revision data should be required.

— Function add_schema src filename

Add the YANG schema from the string src to Snabb’s database of YANG schemas, making it available to load_schema_by_name and related functionality. filename is used when signalling any parse errors. Returns the name of the newly added schema.

— Function add_schema_file filename

Like add_schema, but reads the YANG schema in from a file. Returns the name of the newly added schema.

— Function load_config_for_schema schema src filename

Given the schema object schema, load the configuration from the string src. Returns a parsed configuration as a plain old Lua value that tries to represent configuration values using appropriate Lua types.

The top-level result from parsing will be a table whose keys are the top-level configuration options. For example in the above example:

active true;
routes {
  route { addr 1.2.3.4; port 1; }
  route { addr 2.3.4.5; port 10; }
  route { addr 3.4.5.6; port 2; }
}

In this case, the result would be a table with two keys, active and routes. The value of the active key would be Lua boolean true.

The routes container is just another table of the same kind.

Inside the routes container is the route list, which is represented as an associative array. The particular representation for the associative array depends on characteristics of the list type; see below for details. In this case the route list compiles to a ctable. Therefore to get the port for address 1.2.3.4, you would do:

local yang = require('lib.yang.yang')
local ipv4 = require('lib.protocol.ipv4')
local data = yang.load_config_for_schema(router_schema, conf_str)
local port = data.routes.route:lookup_ptr(ipv4:pton('1.2.3.4')).value.port
assert(port == 1)

Here we see that integer values like the port leaves are represented directly as Lua numbers, if they fit within the uint32 or int32 range. Integers outside that range are represented as uint64_t if they are positive, or int64_t otherwise.

Boolean values are represented using normal Lua booleans, of course.

String values are just parsed to Lua strings, with the normal Lua limitation that UTF-8 data is not decoded. Lua strings look like strings but really they are byte arrays.

There is special support for the ipv4-address, ipv4-prefix, ipv6-address, and ipv6-prefix types from ietf-inet-types, and mac-address from ietf-yang-types. Values of these types are instead parsed to raw binary data that is compatible with the relevant parts of Snabb’s lib.protocol facility.

Let us return to the representation of compound configurations, like list instances. A compound configuration whose shape is fixed is compiled to raw FFI data. A configuration’s shape is determined by its schema. A schema node whose data will be fixed is either a leaf whose type is numeric or boolean and which is either mandatory or has a default value, or a container (leaf-list, container, or list) whose elements are all themselves fixed.

In practice this means that a fixed container will be compiled to an FFI struct type. This is mostly transparent from the user perspective, as in LuaJIT you access struct members by name in the same way as for normal Lua tables.

A fixed leaf-list will be compiled to an FFI array of its element type, but on the Lua side is given the normal 1-based indexing and support for the #len length operator via a wrapper. A non-fixed leaf-list is just a Lua array (a table with indexes starting from 1).

Instances of list nodes can have one of several representations. (Recall that in YANG, list is not a list in the sense that we normally think of it in programming languages, but rather is a kind of hash map.)

If there is only one key leaf, and that leaf has a string type, then a configuration list is represented as a normal Lua table whose keys are the key strings, and whose values are Lua structures holding the leaf values, as in containers. (In fact, it could be that the value of a string-key struct is represented as a C struct, as in raw containers.)

If all key and value types are fixed, then a list configuration compiles to an efficient ctable.

If all keys are fixed but values are not, then a list configuration compiles to a cltable.

Otherwise, a list configuration compiles to a Lua table whose keys are Lua tables containing the keys. This sounds good on the surface but really it’s a pain, because you can’t simply look up a value in the table like foo[{key1=42,key2=50}], because lookup in such a table is by identity and not be value. Oh well. You can still do for k,v in pairs(foo), which is often good enough in this case.

Note that there are a number of value types that are not implemented, including some important ones like union.

— Function load_config_for_schema_by_name schema_name name filename

Like load_config_for_schema, but identifying the schema by name instead of by value, as in load_schema_by_name.

— Function print_config_for_schema schema data file

Serialize the configuration data as text via repeated calls to the write method of file. At the end, the flush method is called on file. schema is the schema that describes data.

— Function compile_config_for_schema schema data filename mtime

Compile data, using a compiler generated for schema, and write out the result to the file named filename. mtime, if given, should be a table with secs and nsecs keys indicating the modification time of the source file. This information will be serialized in the compiled file, and may be used when loading the file to determine whether the configuration is up to date.

— Function compile_config_for_schema_by_name schema_name data filename mtime

Like compile_config_for_schema_by_name, but identifying the schema by name instead of by value, as in load_schema_by_name.

— Function load_compiled_data_file filename

Load the compiled data file at filename. If the file is not a compiled YANG configuration, an error will be signalled. The return value will be table containing four keys:

schema_name: The name of the schema for which this file was compiled.
revision_date: The revision date of the schema for which this file was compiled, or the empty string ('') if unknown.
source_mtime: An mtime table, as for compile_config_for_schema. If no mtime was written into the file, both secs and nsecs will be zero.
data: The configuration data, in the same format as returned by load_config_for_schema.

Hardware

PCI (lib.hardware.pci)

The lib.hardware.pci module provides functions that abstract common operations on PCI devices on Linux. In order to drive a PCI device using Direct memory access (DMA) one must:

Open the PCI device using pci.open_pci_resource_locked or pci.open_pci_resource_unlocked.
Unbind the PCI device using pci.unbind_device_from_linux.
Enable PCI bus mastering for device using pci.set_bus_master in order to enable DMA.
Memory map PCI device configuration space using pci.map_pci_memory.
Control the PCI device by manipulating the memory referenced by the pointer returned by pci.map_pci_memory.
Disable PCI bus master for device using pci.set_bus_master.
Unmap PCI device configuration space using pci.close_pci_resource.

The correct ordering of these steps is absolutely critical.

Users of lib.hardware.pci can rely on steps 6/7 being performed automatically in the event unorderly shutdown. However, to ensure that bus mastering for the PCI device in use is not disabled due to another worker’s shutdown (see core.worker) they must keep a flock(2) on resource 0. This can be achieved either implicitly via pci.open_pci_resource_locked or by manual calls to flock(2).

— Variable pci.devices

An array of supported hardware devices. Must be populated by calling pci.scan_devices. Each entry is a table as returned by pci.device_info.

— Function pci.canonical pciaddress

Returns the canonical representation of a PCI address. The canonical representation is preferred internally in Snabb and for presenting to users. It shortens addresses with leading zeros like this: 0000:01:00.0 becomes 01:00.0.

— Function pci.qualified pciaddress

Returns the fully qualified representation of a PCI address. Fully qualified addresses have the form 0000:01:00.0 and so this function undoes any abbreviation in the canonical representation.

— Function pci.scan_devices

Scans for available PCI devices and populates the pci.devices table.

— Function pci.device_info pciaddress

Returns a table containing information about the PCI device by pciaddress. The table has the following keys:

pciaddress—String denoting the PCI address of the device. E.g. "0000:83:00.1".
vendor—Identification string e.g. "0x8086" for Intel.
device—Identification string e.g. "0x10fb" for 82599 chip.
interface—Name of Linux interface using this device e.g. "eth0".
status—String denoting the Linux operational status, or nil if not known.
driver—String denoting the Lua module that supports this hardware e.g. "apps.intel.intel10g".
usable—String denoting if the device was suitable to use when scanned. One of "yes" or "no".

— Function pci.which_driver vendor, model

Returns the module name for a suitable device driver (if available) for a device of model from vendor.

— Function pci.unbind_device_from_linux pciaddress

Forces Linux to unbind the device identified by pciaddress from any kernel drivers.

— Function pci.set_bus_master pciaddress, enable

Enables or disables PCI bus mastering for device identified by pciaddress depending on whether enable is a true or a false value. PCI bus mastering must be enabled in order to perform DMA on the PCI device.

— Function pci.open_pci_resource_unlocked pciaddress, n — Function pci.open_pci_resource_locked pciaddress, n

Opens configuration space n of PCI device identified by pciaddress. Returns a file descriptor of the opened sysfs resource file.

The two variants indicate if the underlying memory mapped file should be exclusively flocked or not.

— Function pci.map_pci_memory fd

Memory maps configuration space of PCI device identified by fd. Returns a pointer to the memory mapped region. The device must be unbound from linux and PCI bus mastering must be enabled on the device before calling this function.

— Function pci.close_pci_resource file_descriptor, pointer

Closes memory mapped file_descriptor of sysfs resource file and unmaps it from pointer as returned by pci.map_pci_memory.

Register (lib.hardware.register)

The lib.hardware.register module provides an abstraction for hardware device registers. This abstraction can be used to declaratively specify and conveniently manipulate structured memory regions via DMA. The functions register.define and register.define_array construct Register objects based on a register description string. The resulting Register objects can be used to manipulate the defined registers using the methods Register:read, Register:write, Register:set, Register:clr, Register:wait and Register:reset (exact set depends on the register mode).

A register description is a string with one Register object definition per line. A Register object definition must be expressed using the following grammar:

Register   ::= Name Offset Indexing Mode Longname
Name       ::= <identifier>
Indexing   ::= "-"
           ::= "+" OffsetStep "*" Min ".." Max
Mode       ::= "RO" | "RW" | "RC" | "RCR" | "RW64" | "RO64" | "RC64" | "RCR64"
Longname   ::= <string>
Offset ::= OffsetStep ::= Min ::= Max ::= <number>

A Register object definition is made up of the following properties:

Name—A string to be used to refer to the Register object. Must be a valid Lua identifier, e.g. "foo", "foo_bar", "FOO" etc.
Offset—Integer specifying the offset from the base pointer (as supplied to register.define and register.define_array).
Indexing—Optional. Three integers specifying the offset step as well as minimum and maximum indexes in bytes.
Mode—One of "RO", "RW", "RC", "RCR" "RO64", "RW64", "RC64", "RCR64" standing for read-only, read-write and counter modes in 32bit and 64bit modes respectively. Counter mode is for counter registers that clear back to zero when read, RCR is for counters that wrap.
Longname—A string describing the register (used for self-documentation).

For instance, the following Register object definition defines a register range “TXDCTL” in read-write mode starting at offset 0x06028 with 128 registers each of length 0x40.

TXDCTL 0x06028 +0x40*0..127 RW Transmit Descriptor Control

The next example defines a singular register “TPT” in counter mode located at offset 0x01428.

TPT 0x01428 - RC Total Packets Transmitted

— Function register.define description, table, base_pointer, n

Creates Register objects for description relative to base_pointer. The resulting Register objects will become a named entries in table using the names defined in description. If an entry in description defines an indexing range then n specifies the index of the register within that range. N defaults to 0.

— Function register.define_array description, table, base_pointer

Creates Register objects for description relative to base_pointer. The resulting Register objects will become a named entries in table using the names defined in description. If an entry in description defines an indexing range, an array of Register objects will be created instead of a singular Register object.

— Function register.dump table

Prints a pretty-printed register dump of a table of registers.

— Method Register:read

Returns the value of register. For convenience register objects can be called without arguments instead of calling Register:read. E.g. reg:read() is equivalent to reg().

— Method Register:write value

Sets the value of register to value. Only available on registers in read-write mode. For convenience register objects can be called with an argument instead of calling Register:write. E.g. reg:write(value) is equivalent to reg(value).

If register is in counter mode it is assumed that the register will be reset to zero upon reading. The read value is added to a register accumulator and the sum of all reads is returned.

— Method Register:set bitmask

Sets bits of register according to bitmask. Only available on registers in read-write mode.

— Method Register:clr bitmask

Clears bits of register according to bitmask. Only available on registers in read-write mode.

Method Register:bits offset, length, bits

Get or set length bits at offset in register. Sets length bits at offset in register to bits if bits is supplied. Returns length bits at offset in register otherwise. Setting is only available on registers in read-write mode.

Method Register:byte offset, byte

Get or set byte at offset in register. Sets byte at offset in register to byte if byte is supplied. Returns byte at offset in register otherwise. Setting is only available on registers in read-write mode.

— Method Register:wait bitmask, value

Blocks until applying bitmask to the register equals value. If value is not supplied blocks until all bits in the mask are set instead. Only available on registers in read-write and read-only modes.

— Method Register:reset

Reset the register accumulator to 0. Only available on registers in counter mode.

— Method Register:print

Prints the register state to standard output.

Protocols

Protocol Header (lib.protocol.header)

The lib.protocol.header module contains the base class from which the supported protocol classes are derived. It defines generic methods on all protocol subclasses.

— Method header:new_from_mem memory, length

Creates and returns a header object by “overlaying” the respective header structure over length bytes of memory. Returns nil if length is too small to contain the header.

— Method header:header

Returns the raw header as a cdata object.

— Method header:sizeof

Returns the byte size of header.

— Method header:eq header

Generic equality predicate. Returns true if header is equal to self and false otherwise.

— Method header:copy destination, relocate

Copies the header to destination. The caller must ensure that there is enough space at destination. If relocate is a true value, destination is promoted to be the active storage for the header.

— Method header:clone

Returns a copy of the header object.

— Method header:upper_layer

Returns the protocol class that can handle the “upper layer protocol” or nil if the protocol is not supported or the protocol has no upper layer.

For instance, on an Ethernet header object this method might return a IPv4 or IPv6 header class.

Ethernet (lib.protocol.ethernet)

The lib.protocol.ethernet module contains a class for representing Ethernet headers. The ethernet protocol class supports two upper layer protocols: lib.protocol.ipv4 and lib.protocol.ipv6.

— Method ethernet:new config

Returns a new Ethernet header for config. Config must a be a table which may contain the following keys:

dst - Destination MAC (binary representation). Default is 00:00:00:00:00:00.
src - Source MAC (binary representation). Default is 00:00:00:00:00:00.
type - Either 0x0800 or 0x86dd for IPv4/6 individually. Default is 0x0.

— Method ethernet:src mac

— Method ethernet:dst mac

— Method ethernet:type type

Combined accessor and setter methods. These methods set the values of the source, destination and type fields of an Ethernet header. If no argument is given the current value is returned.

Example:

local eth = ethernet:new({src = ethernet:pton("00:00:00:00:00:00"),
                          dst = ethernet:pton("00:00:00:00:00:00"),
                          type = 0x86dd})
eth:dst(ethernet:pton("54:52:00:01:00:00"))
ethernet:ntop(eth:dst()) => "54:52:00:01:00:00"

— Method ethernet:src_eq mac

— Method ethernet:dst_eq mac

Predicate methods to test if mac is equal to the source or destination addresses individually.

— Method ethernet:swap

Swaps the values of the source and destination fields.

— Function ethernet:pton string

Returns the binary representation of MAC address denoted by string.

— Function ethernet:ntop mac

Returns the string representation of mac address.

— Function ethernet:is_mcast mac

Returns a true value if mac address denotes a Multicast address.

— Function ethernet:is_bcast mac

Returns a true value if mac address denotes a Broadcast address.

— Function ethernet:ipv6_mcast ip

Returns the MAC address for IPv6 multicast ip as defined by RFC2464, section 7.

IPv4 (lib.protocol.ipv4)

The lib.protocol.ipv4 module contains a class for representing IPv4 headers. The ipv4 protocol class supports four upper layer protocols: lib.protocol.tcp, lib.protocol.udp, lib.protocol.gre and lib.protocol.icmp.header.

— Method ipv4:new config

Returns a new IPv4 header for config. Config must a be a table which may contain the following keys:

dst - Destination IPv4 address (binary representation). Default is 0.0.0.0.
src - Source IPv4 address (binary representation). Default is 0.0.0.0.
protocol - The upper layer protocol, can be 6 (TCP), 17 (UDP), 47 (GRE) or 58 (ICMP). Default is 255.
dscp - “Differentiated Services Code Point” field (6 bit unsigned integer). Default is 0.
ecn - “Explicit Congestion Notification” field (2 bit unsigned integer). Default is 0.
id - “Identification” field (16 bit unsigned integer). Default is 0.
flags - “Don’t Fragment (DF)” and “More Fragments (MF)” fields (3 bit unsigned integer). Default is 0.
frag_off - “Fragment Offset” field (13 bit unsigned integer). Default is 0.
ttl - “Time To Live” field (8 bit unsigned integer). Default is 0.

— Method ipv4:dst ip

— Method ipv4:src ip

— Method ipv4:protocol protocol

— Method ipv4:dscp dscp

— Method ipv4:ecn ecn

— Method ipv4:id id

— Method ipv4:flags flags

— Method ipv4:frag_off frag_off

— Method ipv4:ttl ttl

Combined accessor and setter methods. These methods set the values of the instance fields (see new) of an IPv4 header. If no argument is given the current value is returned.

— Method ipv4:version version

Combined accessor and setter method for the “Version” field (4 bit unsigned integer). Defaults to 4 (set automatically by new). Sets the “Version” field to version. If no argument is given the current value is returned.

— Method ipv4:ihl ihl

Combined accessor and setter method for the “Internet Header Length” field (4 bit unsigned integer). Set automatically by new. Sets the “Internet Header Length” field to ihl. If no argument is given the current value is returned.

— Method ipv4:total_length length

Combined accessor and setter method for the “Total Length” field (16 bit unsigned integer). Defaults to header length (set automatically by new). Sets the “Total Length” field to length. If no argument is given the current value is returned.

— Method ipv4:checksum

Computes and sets the IPv4 header checksum. Its called automatically by new but must be called after the header is changed.

— Method ipv4:dst_eq ip

— Method ipv4:src_eq ip

Predicate methods to test if ip is equal to the source or destination addresses individually.

— Function ipv4:pton string

Returns the binary representation of IPv4 address denoted by string.

— Function ipv4:ntop ip

Returns the string representation of ip address.

IPv6 (lib.protocol.ipv6)

The lib.protocol.ipv6 module contains a class for representing IPv6 headers. The ipv6 protocol class supports four upper layer protocols: lib.protocol.tcp, lib.protocol.udp, lib.protocol.gre and lib.protocol.icmp.header.

— Method ipv6:new config

Returns a new IPv6 header for config. Config must a be a table which may contain the following keys:

dst - Destination IPv6 address (binary representation). Default is 0::0.
src - Source IPv6 address (binary representation). Default is 0::0.
traffic_class - “Traffic Class” field (8 bit unsigned integer). Default is 0.
flow_label - “Flow Label” field (20 bit unsigned integer). Default is 0.
next_header - “Next Header” field (8 bit unsigned integer). Default is 0.
hop_limit - “Hop Limit” field (8 bit unsigned integer). Default is 0.

— Method ipv6:dst ip

— Method ipv6:src ip

— Method ipv6:traffic_class traffic_class

— Method ipv6:flow_label flow_label

— Method ipv6:next_header next_header

— Method ipv6:hop_limit hop_limit

Combined accessor and setter methods. These methods set the values of the instance fields (see new) of an IPv6 header. If no argument is given the current value is returned.

— Method ipv6:version version

Combined accessor and setter method for the version field (4 bit unsigned integer). Defaults to 6 (set automatically by new). Sets the “Version” field to version. If no argument is given the current value is returned.

— Method ipv6:dscp dscp

Combined accessor and setter method for the “Differentiated Services Code Point” field (6 bit unsigned integer). Default is 0. This is a sub-field of the “Traffic Class” field. Sets the “Differentiated Services Code Point” field to dscp. If no argument is given the current value is returned.

— Method ipv6:ecn ecn

Combined accessor and setter method for the “Explicit Congestion Notification” (2 bit unsigned integer). Default is 0. This is a sub-field of the “Traffic Class” field. Sets the “Explicit Congestion Notification” field to ecn. If no argument is given the current value is returned.

— Method ipv6:payload_length length

Combined accessor and setter method for the “Payload Length” field (16 bit unsigned integer). Default is 0. Sets the “Payload Length” field to length. If no argument is given the current value is returned.

— Method ipv6:dst_eq ip

— Method ipv6:src_eq ip

Predicate methods to test if ip is equal to the source or destination addresses individually.

— Function ipv6:pton string

Returns the binary representation of IPv6 address denoted by string.

— Function ipv6:ntop ip

Returns the string representation of ip address.

— Function ipv6:solicited_node_mcast ip

Returns the solicited-node multicast address from the given unicast ip.

TCP (lib.protocol.tcp)

The lib.protocol.tcp module contains a class for representing TCP headers.

— Method tcp:new config

Returns a new TCP header for config. Config must a be a table which may contain the following keys:

src_port - “Source Port Number” field (16 bit unsigned integer). Default is 0.
dst_port - “Destination Port Number” field (16 bit unsigned integer). Default is 0.
seq_num - “Sequence Number” field (32 bit unsigned integer). Default is 0.
ack_num - “Acknowledgement Number” field (32 bit unsigned integer). Default is 0.
window_size - “Window Size” field (16 bit unsigned integer). Default is 0.
offset - “Data Offset” field (4 bit unsigned integer). Default is 0.
ns - “NS” flag (1 bit). Default is 0.
cwr - “CWR” flag (1 bit). Default is 0.
ece - “ECE” flag (1 bit). Default is 0.
urg - “URG” flag (1 bit). Default is 0.
ack - “ACK” flag (1 bit). Default is 0.
psh - “PSH” flag (1 bit). Default is 0.
rst - “RST” flag (1 bit). Default is 0.
syn - “SYN” flag (1 bit). Default is 0.
fin - “FIN” flag (1 bit). Default is 0.

— Method tcp:src_port port

— Method tcp:dst_port port

— Method tcp:seq_num seq_num

— Method tcp:ack_num ack_num

— Method tcp:window_size window_size

— Method tcp:offset offset

— Method tcp:ns ns

— Method tcp:cwr cwr

— Method tcp:ece ece

— Method tcp:urg urg

— Method tcp:ack ack

— Method tcp:psh psh

— Method tcp:rst rst

— Method tcp:syn syn

— Method tcp:fin fin

Combined accessor and setter methods. These methods set the values of the instance fields (see new) of a TCP header. If no argument is given the current value is returned.

— Method tcp:flags flags

Combined accessor and setter method for the TCP header flags (NS, CRW, ECE, URG, ACK, PSH, RST, SYN and FIN). Sets the header’s flags accoring to flags (9 bit unsigned intetger). If no argument is given the current flags are returned.

— Method tcp:checksum payload, length, ip

Computes and sets the “Checksum” field for length bytes of payload and optionally ip. If no argument is given the current value of the “Checksum” field is returned.

UDP (lib.protocol.udp)

The lib.protocol.udp module contains a class for representing UDP headers.

— Method udp:new config

Returns a new UDP header for config. Config must a be a table which may contain the following keys:

src_port - “Source Port Number” field (16 bit unsigned integer). Default is 0.
dst_port - “Destination Port Number” field (16 bit unsigned integer). Default is 0.

— Method udp:src_port port

— Method udp:dst_port port

Combined accessor and setter methods for the source and destination port fields. Sets the source or destination port individually. Returns the current port if called without arguments. Default is 8 (the UDP header length).

— Method udp:length length

Combined accessor and setter method for the “Length” field. Sets the “Length” field* to length (a 16 bit unsigned integer). If no argument is given the current value of the “Length” field is returned.

— Method udp:checksum payload, length, ip

Computes and sets the “Checksum” field for length bytes of payload and optionally ip. If no argument is given the current value of the “Checksum” field is returned.

GRE (lib.protocol.gre)

The lib.protocol.gre module contains a class for representing GRE headers. The gre protocol class only supports the checksum and key extensions and the lib.protocol.ethernet upper layer protocol.

— Method gre:new config

Returns a new GRE header for config. Config must a be a table which may contain the following keys:

protocol - Upper layer protocol. May be 0x6558 (Ethernet). Default is nil.
checksum - Set to true to enable checksumming. Default is false.
key - 32 bit unsigned integer. Enables keying if supplied. Default is nil.

— Method gre:checksum payload, length

Combined accessor and setter method for the checksum field. Computes and sets the checksum field for length bytes of payload. If no argument is given the current checksum is returned. Returns nil if checksumming is disabled.

— Method gre:checksum_check payload, length

Predicate to verify length bytes of payload against the header checkum. Return nil if checksumming is disabled.

— Method gre:key key

Combined accessor and setter method for the key field. Sets the key field to key. If no argument is given the current key is returned. Returns nil if keying is disabled.

— Method gre:protocol protocol

Combined accessor and setter method for the upper layer protocol. Sets the upper layer protocol to protocol. If no argument is given the current upper layer protocol is returned.

The lib.protocol.icmp.header module contains a class for representing ICMP headers. The icmp protocol class currently supports two upper layer protocols: lib.protocol.icmp.nd.ns and lib.protocol.icmp.nd.na. These upper layer protocols implement the headers necessary to perform “Neighbor Discovery”.

— Method icmp:new type, code

Returns a new ICMP header of type which may be either 135 or 136 for lib.protocol.icmp.nd.ns or lib.protocol.icmp.nd.na respectively. Optionally code can be supplied to set the “Code” field for the type.

— Method icmp:type type

— Method icmp:code code

Combined accessor and setter methods. These methods set the values of the instance fields (see new) of an ICMP header. If no argument is given the current value is returned.

— Method icmp:checksum payload, length, ipv6

Computes and sets the “Checksum” field for length bytes of payload. If the lower protocol layer is lib.protocol.ipv6 then ipv6 must be set to a true value.

— Method icmp:checksum_check payload, length, ipv6

Predicate to test if the header’s “Checksum” field matches length bytes of payload. If the lower protocol layer is lib.protocol.ipv6 then ipv6 must be set to a true value.

Neighbor Solicitation (lib.protocol.icmp.nd.ns)

— Method ns:new target

Returns a new Neighbor Solicitation header. Target is the IP address used for the “Target Address” field.

— Method ns:target target

Combined accessor and setter method for the “Target Address” field. Sets the “Target Address” field to target. If no argument is given the current value is returned.

— Method ns:target_eq target

Predicate to test if the header’s value in the “Target Address” field is equivalent to target.

Neighbor Advertisement (lib.protocol.icmp.nd.na)

— Method na:new target, router, solicited, override

Returns a new Neighbor Advertisement header. Target is the IP address used for the “Target Address” field. Router, solicited and override can be boolean values to set the “Router”, “Solicited” and “Override” flags respectively. The default for the flags is 0.

— Method ns:target target

— Method ns:router router

— Method ns:solicited solicited

— Method ns:override override

Combined accessor and setter methods. These methods set the values of the instance fields (see new) of an Neighbor Advertisement header. If no argument is given the current value is returned.

— Method ns:target_eq target

Predicate to test if the header’s value in the “Target Address” field is equivalent to target.

Both Neighbor Solicitation and Advertisement (lib.protocol.icmp.nd.ns and lib.protocol.icmp.nd.na) headers implement an options method for parsing TLV Options contained in the their payloads.

Example:

 -- Parse datagram with ICMP/NA packet
local na = dgram:parse()
 -- Parse TLV Options
local options = na:options(dgram:payload())

— Method nd:options payload, length

Parses and returns an array of TLV Options (see lib.protocol.icmp.nd.options.tlv) from length bytes of payload.

TLV Option (lib.protocol.icmp.nd.options.tlv)

The lib.protocol.icmp.nd.options.tlv module contains a class for representing TLV Options. Currently only two types of options are implemented: “Source Link-Layer Address” ("src_ll_addr") and “Target Link-Layer Address” ("tgt_ll_address"). Both are represented by the lladdr class (see lib.protocol.icmp.nd.options.lladdr).

— Method tlv:new type, data

Returns a new TLV Option object for data of type. Type may be either 1 for “Source Link-Layer Address” or 2 for “Target Link-Layer Address”. Data must be a lladdr object.

— Method tlv:name

Returns a string denoting the type of the option. Either "src_ll_addr" for “Source Link-Layer Address” or "tgt_ll_address" for “Target Link-Layer Address”.

— Method tlv:length

Returns the the size of the TLV Option as multiples of 8 bytes.

— Method tlv:type type

Combined accessor and setter method. Sets the type field (see new) to type. If no argument is given the current value of the type field is returned.

— Method tlv:option

Returns an object of the class denoted by the type field. Currently that only includes lladdr instances.

Link-Layer Address Option (lib.protocol.icmp.nd.options.lladdr)

The lib.protocol.icmp.nd.options.lladdr module contains a class for representing Link-Layer Address Options.

— Method lladdr:new address

Returns a new Link-Layer Option object for MAC address in binary representation.

— Method lladdr:name

Returns the string "ll_addr".

— Method lladdr:addr address

Combined accessor and setter method. Sets the address field (see new) to address. If no argument is given the current value of the address field is returned.

Datagram (lib.protocol.datagram)

The lib.protocol.datagram module provides basic mechanisms for parsing, building and manipulating a hierarchy of protocol headers and the associated payload contained in a data packet. In particular, it supports:

Parsing and in-place manipulation of protocol headers in a received packet
In-place decapsulation by removing leading protocol headers
Adding headers to an existing packet
Creation of a new packet
Appending payload to a packet

It mediates between packets as defined in core.packet and protocol classes which are defined as classes derived from the protocol header base class in the lib.protocol.header module.

The contents of a datagram instance are logically divided into three areas: The payload, parsed headers and pushed headers. The datagram payload is a sequence of bytes either inherited from the packet given to datagram:new or appended using datagram:payload. The headers in the payload can be parsed using datagram:parse_match, which will shrink the payload by the header. Finally, synthetic headers can be prepended to the datagram using datagram:push. To get the whole datagram as a packet use datagram:packet.

Datagram

A datagram can be used in two modes of operation, called “immediate commit” and “delayed commit”. In immediate commit mode, the push and pop methods immediately modify the underlying packet. However, this can be undesireable.

Even though the manipulations are relatively fast by using SIMD instructions to move and copy data when possible, performance-aware applications usually try to avoid as much of them as possible. This creates a conflict if the caller performs operations to push or parse a sequence of protocol headers in immediate commit mode.

This problem can be avoided by using delayed commit mode. In this mode, the push methods add the data to a separate buffer as intermediate storage. The buffer is prepended to the actual packet in a single operation by calling datagram:commit.

The pop methods are made light-weight in delayed commit mode as well by keeping track of an additional offset that indicates where the actual packet starts in the packet buffer. Each call to one of the pop methods simply increases the offset by the size of the popped piece of data. The accumulated actions will be applied as a single operation by datagram:commit.

The push and pop methods can be freely mixed in delayed commit mode.

Due to the destructive nature of these methods in immediate commit mode, they cannot be applied when the parse stack is not empty, because moving the data in the packet buffer will invalidate the parsed headers. The push and pop methods will raise an error in that case.

The buffer used in delayed commit mode has a fixed size of 512 bytes. This limits the size of data that can be pushed in a single operation. A sequence of push/commit operations can be used to push an arbitrary amount of data in chunks of up to 512 bytes.

— Method datagram:new packet, protocol, options

Creates a datagram for packet or from scratch if packet is nil. Protocol will be used by parse_match to parse the packet payload. If protocol is not nil it is set as the initial upper layer protocol. If options is not nil it must be a table that selects configurable properties of the class. Currently, the only option is the selection of immediate or delayed commit mode by setting the key delayed_commit to false or true, respectively. The default is immediate commit mode.

— Method datagram:push header

Prepends header to the front of the datagram. This method destructively modifies the underlying packet in immediate commit mode and raises an error if the parse stack is not empty.

In delayed commit mode, header is prepended to an intermediate buffer.

— Method datagram:push_raw data, length

This method behaves like the datagram:push method for an arbitrary chunk of memory of length length located at the address pointed to by data.

— Method datagram:parse_match protocol, check

Attempts to parse the next header in the datagram, thereby removing it from the payload. Returns a header instance of class protocol on success. If protocol is nil the current upper layer protocol as set by datagram:new or previous calls to parse_match is used.

If neither protocol nor the upper layer protocol is set or the constructor of the protocol class returns nil, the parsing operation has failed and parse_match returns nil. The datagram remains unchanged.

If the protocol class instance has been created successfully, it is passed as single argument to the anonymous function check.

If check returns a false value, the parsing has failed and parse_match returns nil. The packet remains unchanged.

If check is not supplied or if it returned a true value, the parsing has succeeded and the current upper layer protocol of the datagram is set to the value returned by header:upper_layer.

— Method datagram:parse protocols_and_checks

A wrapper around parse_match that allows parsing of a sequence of headers with a single method call.

If protocols_and_checks is a sequence of protocol class and check function pairs, parse_match is called for each pair. Returns the header object of the last header parsed or nil if any of the calls to parse_match return nil.

If called with a nil argument, this method is equivalent to parse_match called without arguments.

— Method datagram:parse_n n

A wrapper around parse_match that parses the next n protocol headers using the current upper layer protocol and subsequent values of header:upper_layer. It returns the last header object or nil if less than n headers could be parsed successfully.

— Method datagram:unparse n

Undoes the last n calls to parse_match on the datagram. E.g. prepends n parsed headers back to the payload. The sequence of parsed headers can be obtained by calling stack.

— Method datagram:pop n

Removes the leading n parsed headers from the datagram. Note that headers added via push can not be removed using pop. The caller has to ensure that the datagram contains at least n headers that were parsed using parse_match. The sequence of parsed headers can be obtained by calling stack. This method destructively modifies the underlying packet in immediate commit mode and raises an error if the parse stack is not empty.

In delayed commit mode, the packet is not modified and the parse stack remains valid.

For instance let d be an datagram with an Ethernet header followed by an IPv6 header. Assuming we have parsed both headers using d:parse_n(2), we could call d:pop(1) to decapsulate the IPv6 packet from its Ethernet header.

— Method datagram:pop_raw length, ulp

Removes length bytes from the beginning of the datagram. If ulp is given it is set as the current upper layer protocol. This method destructively modifies the underlying packet in immediate commit mode and raises an error if the parse stack is not empty.

In delayed commit mode, the packet is not modified and the parse stack remains valid.

— Method datagram:stack

Returns the parsed header objects as a sequence.

— Method datagram:packet

Returns a packet (see core.packet) containing the datagram (including pushed headers).

— Method datagram:payload pointer, length

Combined payload accessor and setter method. Returns a pointer to the datagram payload and its byte size.

If pointer and length are supplied then length bytes starting from pointer are appended to the datagram’s payload.

— Method datagram:data

Returns data and length of the underlying packet.

Method datagram:commit

If called in delayed commit mode, the operations accumulated by the push and pop methods since the creation of the datagram or the last invocation of datagram:commit are commited to the underlying packet. An error is raised if the parse stack is not empty.

The method can be safely called in immediate commit mode.

IPsec

Encapsulating Security Payload (lib.ipsec.esp)

The lib.ipsec.esp module contains two classes encrypt and decrypt which implement packet encryption and decryption with IPsec ESP in both tunnel and transport modes. Currently, the only supported cipher is AES-GCM with 128‑bit keys, 4 bytes of salt, and a 16 byte authentication code. These classes do not implement any key exchange protocol.

Note: the classes in this module do not reject IP fragments of any sort.

References:

IPsec Wikipedia page.
RFC 4303 on IPsec ESP.
RFC 4106 on using AES-GCM with IPsec ESP.
LISP Data-Plane Confidentiality example of a software layer above these apps that includes key exchange.

— Method encrypt:new config

— Method decrypt:new config

Returns a new encryption/decryption context respectively. Config must a be a table with the following keys:

aead - AEAD identifier (string). The only accepted value is "aes-gcm-16-icv" (AES-GCM with a 16 byte ICV).
spi - A 32 bit integer denoting the “Security Parameters Index” as specified in RFC 4303.
key - Hexadecimal string of 32 digits (two digits for each byte, most significant digit first) that denotes 16 bytes of high-entropy key material as specified in RFC 4106.
salt - Hexadecimal string of eight digits (two digits for each byte) that denotes four bytes of salt as specified in RFC 4106.
window_size - Optional. Minimum width of the window in which out of order packets are accepted as specified in RFC 4303. The default is 128. (decrypt only.)
resync_threshold - Optional. Number of consecutive packets allowed to fail decapsulation before attempting “Re-synchronization” as specified in RFC 4303. The default is 1024. (decrypt only.)
resync_attempts - Optional. Number of attempts to re-synchronize a packet that triggered “Re-synchronization” as specified in RFC 4303. The default is 8. (decrypt only.)
auditing - Optional. A boolean value indicating whether to enable or disable “Auditing” as specified in RFC 4303. The default is nil (no auditing). (decrypt only. Note: source address, destination address and flow ID are only logged when using decapsulate_transport6.)

Tunnel mode

In tunnel mode, encapsulation accepts packets of any format and wraps them in an ESP frame, encrypting the original packet contents. Decapsulation reverses the process: it removes the ESP frame and returns the original input packet.

ESP-Tunnel

— Method encrypt:encapsulate_tunnel packet, next_header

Encapsulates packet and encrypts its payload. The ESP header’s Next Header field is set to next_header. Takes ownership of packet and returns a new packet.

— Method decrypt:decapsulate_transport6 packet

Decapsulates packet and decrypts its payload. On success, takes ownership of packet and returns a new packet and the value of the ESP header’s Next Header field. Otherwise returns nil.

Transport mode

In transport mode, encapsulation accepts IPv6 packets and inserts a new ESP header between the outer IPv6 header and the inner protocol header (e.g. TCP, UDP, L2TPv3) and also encrypts the contents of the inner protocol header. Decapsulation does the reverse: it decrypts the inner protocol header and removes the ESP protocol header. In this mode it is expected that an Ethernet header precedes the outer IPv6 header.

ESP-Transport

— Method encrypt:encapsulate_transport6 packet

Encapsulates packet and encrypts its payload. On success, takes ownership of packet and returns a new packet. Otherwise returns nil.

— Method decrypt:decapsulate_transport6 packet

Decapsulates packet and decrypts its payload. On success, takes ownership of packet and returns a new packet. Otherwise returns nil.

Snabb NFV

NFV config (program.snabbnfv.nfvconfig)

The program.snabbnfv.nfvconfig module implements a Network Functions Virtualization component based on Snabb. It introduces a simple configuration file format to describe NFV configurations which it then compiles to app networks. This NFV component is compatible with OpenStack Neutron.

NFV

— Function nfvconfig.load file, pci_address, socket_path

Loads the NFV configuration from file and compiles an app network using pci_address and socket_path for the underlying NIC driver and VhostUser apps. Returns the resulting engine configuration.

NFV Configuration Format

The configuration file format understood by program.snabbnfv.nfvconfig is based on Lua expressions. Initially, it contains a list of NFV ports:

return { <port-1>, ..., <port-n> }

Each port is defined by a range of properties which correspond to the configuration parameters of the underlying apps (NIC driver, VhostUser, PcapFilter, RateLimiter, nd_light and SimpleKeyedTunnel):

port := { port_id        = <id>,          -- A unique string
          mac_address    = <mac-address>, -- MAC address as a string
          vlan           = <vlan-id>,     -- ..
          ingress_filter = <filter>,       -- A pcap-filter(7) expression
          egress_filter  = <filter>,       -- ..
          tunnel         = <tunnel-conf>,
          crypto         = <crypto-conf>,
          rx_police      = <n>,           -- Allowed input rate in Gbps
          tx_police      = <n> }          -- Allowed output rate in Gbps

The tunnel section deviates a little from SimpleKeyedTunnel’s terminology:

tunnel := { type          = "L2TPv3",     -- The only type (for now)
            local_cookie  = <cookie>,     -- As for SimpleKeyedTunnel
            remote_cookie = <cookie>,     -- ..
            next_hop      = <ip-address>, -- Gateway IP
            local_ip      = <ip-address>, -- ~ `local_address'
            remote_ip     = <ip-address>, -- ~ `remote_address'
            session       = <32bit-int> } -- ~ `session_id'

The crypto section allows configuration of traffic encryption based on apps.ipsec.esp:

crypto := { type          = "esp-aes-128-gcm", -- The only type (for now)
            spi           = <spi>,             -- As for apps.ipsec.esp
            transmit_key  = <key>,
            transmit_salt = <salt>,
            receive_key   = <key>,
            receive_salt  = <salt>,
            auditing      = <boolean> }

snabbnfv traffic

The snabbnfv traffic program loads and runs a NFV configuration using program.snabbnfv.nfvconfig. It can be invoked like so:

./snabb snabbnfv traffic <file> <pci-address> <socket-path>

snabbnfv traffic runs the loaded configuration indefinitely and automatically reloads the configuration file if it changes (at most once every second).

snabbnfv neutron2snabb

The snabbnfv neutron2snabb program converts Neutron database CSV dumps to the format used by program.snabbnfv.nfvconfig. For more info see Snabb NFV Architecture. It can be invoked like so:

./snabb snabbnfv neutron2snabb <csv-directory> <output-directory> [<hostname>]

snabbnfv neutron2snabb reads the Neutron configuration csv-directory and translates them to one lib.nfv.conig configuration file per physical network. If hostname is given, it overrides the hostname provided by hostname(1).

LISPER

LISPER (program.lisper)

Snabb Switch program for overlaying Ethernet networks on the IPv6 Internet or a local IPv6 network. For transporting L2 networks over the Internet, LISPER requires the use of external LISP (RFC 6830) controllers.

Overview

LISPER transports L2 networks over an IPv6 network by connecting together Ethernet networks and L2TPv3 point-to-point tunnels that are on different locations on the transport network.

Each location runs an instance of LISPER and an instance of a LISP controller to which multiple network interfaces can be connected.

Some of the interfaces can connect to physical Ethernet networks, others can connect to IPv6 networks (routed or not). The IPv6 interfaces carry packets to/from L2TPv3 tunnels and to/from remote LISPER instances. The same IPv6 interface can connect to multiple tunnels and/or LISPER instances so a single interface is sufficient to connect everything at one location, unless there are direct Etherent networks which need connecting too which require separate interfaces.

LISPER can work with to any Linux eth interface via raw sockets or it can use its built-in Intel10G driver to work with Intel 82599 network cards directly. The Intel10G driver also supports 802.1Q which allows multiple virtual interfaces to be configured on a single network card.

Download

https://github.com/capr/snabbswitch/archive/master.zip

Compile

make

Quick Demo

Tested on Ubuntu 14.04 and NixOS 15.09.

cd src/program/lisper/dev-env

./net-bringup      # create a test network and start everything
./ping-all         # run ping tests
./net-bringdown    # kill everything and clean up

NOTE: The test network creates network namespaces r2 and nodeN where N=01..08 so make sure you don’t use these namespaces already.

Run

src/snabb lisper -c <config.file>

Configure

The config file is a JSON file that looks like this:

{
   "control_sock" : "/var/tmp/lisp-ipc-map-cache04",
   "punt_sock"    : "/var/tmp/lispers.net-itr04",
   "arp_timeout"  : 60, // seconds

   "interfaces": [
      { "name": "e0",  "mac": "00:00:00:00:01:04",
                        "pci": "0000:05:00.0", "vlan_id": 2 },
      { "name": "e03", "mac": "00:00:00:00:01:03" },
      { "name": "e13", "mac": "00:00:00:00:01:13" }
   ],

   "exits": [
      { "name": "e0", "ip": "fd80:4::2", "interface": "e0",
         "next_hop": "fd80:4::1" }
   ],

   "lispers": [
      { "ip": "fd80:8::2", "exit": "e0" }
   ],

   "local_networks": [
      { "iid": 1, "type": "L2TPv3", "ip": "fd80:1::2", "exit": "e0",
         "session_id": 1, "cookie": "" },
      { "iid": 1, "type": "L2TPv3", "ip": "fd80:2::2", "exit": "e0",
         "session_id": 2, "cookie": "" },
      { "iid": 1, "interface": "e03" },
      { "iid": 1, "interface": "e13" }
   ]
}

Connectivity with the LISP controller requires control_sock and punt_sock, two named sockets that must be the same sockets that the LISP controller was configured with. These can be skipped if there’s no LISP controller.

interface is an array defining the physical interfaces. name and mac are required. If pci is given, the Intel10G driver is used. If vlan_id is given, the interface is assumed to be a 802.1Q trunk.

exits is an array defining the IPv6 exit points (if any) which are used for connecting to remote LISPER instances and to L2TPv3 tunnels. name, ip, interface, next_hop are all required fields.

lispers is an array defining remote LISPER instances, if any. ip and exit are required.

local_networks is an array defining the local L2 networks connected to this LISPER instance. These can be either local networks (in which case only interface is required) or L2TPv3 end-points (in which case type must be “L2TPv3”, and ip, session_id, cookie and exit are required).

–

Demo/Test Suite

TL;DR

cd src/program/lisper/dev-env

./net-bringup             # create a test network and start everything
./net-bringup-intel10g    # create a test network using Intel10G cards
./ping-all                # run ping tests
./nsnode N                # get a shell in the network namespace of a node
./nsr2                    # get a shell in the network namespace of R2
./net-teardown            # kill everything and clean up

NOTE: net-bringup-intel10g requires 4 network cards with loopback cables between cards 1,2 and 3,4. Edit the script to set their names and PCI addresses and also edit lisperXX.conf.intel10g config files and change the pci and vlan_id fields as needed. You can find the PCI addresses of the cards in your machine with lspci | grep 82599.

./ping-all sends 2000 IPv4 pings 1000-byte each between various nodes. It’s output should look like this:

l2tp-l2tp
2000 packets transmitted, 2000 received, 0% packet loss, time 443ms
2000 packets transmitted, 2000 received, 0% packet loss, time 603ms
l2tp-eth
2000 packets transmitted, 2000 received, 0% packet loss, time 358ms
2000 packets transmitted, 2000 received, 0% packet loss, time 502ms
eth-l2tp
2000 packets transmitted, 2000 received, 0% packet loss, time 354ms
2000 packets transmitted, 2000 received, 0% packet loss, time 507ms
l2tp-lisper-l2tp
2000 packets transmitted, 2000 received, 0% packet loss, time 1026ms
2000 packets transmitted, 2000 received, 0% packet loss, time 1037ms
eth-lisper-eth
2000 packets transmitted, 2000 received, 0% packet loss, time 926ms
2000 packets transmitted, 2000 received, 0% packet loss, time 876ms

What does it do

The test network is comprised of multiple network nodes that are all connected to an R2 IPv6 router. The nodes are in different network namespaces and are assigned IPs in different IPv6 subnets to simulate physical locations.

Node namespaces are named nodeXX where XX is 01, 02, 04, 05, 06 and 08. The router lives in the r2 namespace.

Nodes 01, 02, 05, 06 each contain both endpoints of an L2TPv3 tunnel.

Nodes 04, 08 each contain one LISPER instance and one local Ethernet network.

Each node has at least one interface in the L2 overlay network with ip 10.0.0.N/24. You should be able to ping between any of them (see ping-all).

Note the speed differences between nodes. The worst case is if you go to node 01 (which contains 10.0.0.1 which is a L2TPv3 tunnel) and from there ping 10.0.0.5 (which is itself on a L2TPv3 tunnel on a remote LISPER).

Bugs and Limitations

encryption between LISPER nodes is not implemented.
L2 multicast is not implemented.
arp_timeout config option is not followed.
more testing with MAC addresses moving between locations is required.
more performance testing and tuning is required.
only one IPv6 exit-point per interface is supported.

Ptree

Ptree (program.ptree)

Example Snabb program for prototyping multi-process YANG-based network functions.

Overview

The lib.ptree facility in Snabb allows network engineers to build a network function out of a tree of processes described by a YANG schema. The root process runs the management plane, and the leaf processes (the “workers”) run the data plane. The apps and links in the workers are declaratively created as a function of a YANG configuration.

This snabb ptree program is a tool to allow quick prototyping of network functions using the ptree facilities. The invocation syntax of snabb ptree is as follows:

snabb ptree [OPTION...] SCHEMA.YANG SETUP.LUA CONF

The schema.yang file contains a YANG schema describing the network function’s configuration. setup.lua defines a Lua function mapping a configuration to apps and links for a set of worker processes. conf is the initial configuration of the network function.

Example: Simple packet filter

Let’s say we’re going to make a packet filter application. We can use Snabb’s built-in support for filters expressed in pflang, the language used by tcpdump, and just hook that filter up to a full-duplex NIC.

To begin with, we have to think about how to represent the configuration of the network function. If we simply want to be able to specify the PCI device of a NIC, an RSS queue, and a filter string, we could describe it with a YANG schema like this:

module snabb-pf-v1 {
  namespace snabb:pf-v1;
  prefix pf-v1;

  leaf device { type string; mandatory true; }
  leaf rss-queue { type uint8; default 0; }
  leaf filter { type string; default ""; }
}

We throw this into a file pf-v1.yang. In YANG, a module’s body contains configuration declarations, most importantly leaf, container, and list. In our snabb-pf-v1 schema, there is a module containing three leafs: device, rss-queue, and filter. Snabb effectively generates a validating parser for configurations following this YANG schema; a configuration file must contain exactly one device FOO; declaration and may contain one rss-queue statement and one filter statement. Thus a concrete configuration following this YANG schema might look like this:

device 83:00.0;
rss-queue 0;
filter "tcp port 80";

So let’s just drop that into a file pf-v1.cfg and use that as our initial configuration.

Now we just need to map from this configuration to app graphs in some set of workers. The setup.lua file should define this function.

-- Function taking a snabb-pf-v1 configuration and
-- returning a table mapping worker ID to app graph.
return function (conf)
   -- Write me :)
end

The conf parameter to the setup function is a Lua representation of config data for this network function. In our case it will be a table containing the keys device, rss_queue, and filter. (Note that Snabb’s YANG support maps dashes to underscores for the Lua data, so it really is rss_queue and not rss-queue.)

The return value of the setup function is a table whose keys are “worker IDs”, and whose values are the corresponding app graphs. A worker ID can be any Lua value, for example a number or a string or whatever. If the user later reconfigures the network function (perhaps setting a different filter string), the manager will re-run the setup function to produce a new set of worker IDs and app graphs. The manager will then stop workers whose ID is no longer present, start new workers, and reconfigure workers whose ID is still present.

In our case we’re just going to have one worker, so we can use any worker ID. If the user reconfigures the filter but keeps the same device and RSS queue, we don’t want to interrupt packet flow, so we want to use a worker ID that won’t change. But if the user changes the device, probably we do want to restart the worker, so maybe we make the worker ID a function of the device name.

With all of these considerations, we are ready to actually write the setup function.

local app_graph = require('core.config')
local pci = require('lib.hardware.pci')
local pcap_filter = require('apps.packet_filter.pcap_filter')

-- Function taking a snabb-pf-v1 configuration and
-- returning a table mapping worker ID to app graph.
return function (conf)
   -- Load NIC driver for PCI address.
   local device_info = pci.device_info(conf.device)
   local driver = require(device_info.driver).driver

   -- Make a new app graph for this configuration.
   local graph = app_graph.new()
   app_graph.app(graph, "nic", driver,
                 {pciaddr=conf.device, rxq=conf.rss_queue,
                  txq=conf.rss_queue})
   app_graph.app(graph, "filter", pcap_filter.PcapFilter,
                 {filter=conf.filter})
   app_graph.link(graph, "nic."..device_info.tx.." -> filter.input")
   app_graph.link(graph, "filter.output -> nic."..device_info.rx)

   -- Use DEVICE/QUEUE as the worker ID.
   local id = conf.device..'/'..conf.rss_queue

   -- One worker with the given ID and the given app graph.
   return {[id]=graph}
end

Put this in, say, pf-v1.lua, and we’re good to go. The network function can be run like this:

$ snabb ptree --name my-filter pf-v1.yang pf-v1.lua pf-v1.cfg

See snabb ptree --help for full details on arguments like --name.

Tuning

The snabb ptree program also takes a number of options that apply to the data-plane processes.

— –cpu cpus

Allocate cpus to the data-plane processes. The manager of the process tree will allocate CPUs from this set to data-plane workers. For example, For example, --cpu 3-5,7-9 assigns CPUs 3, 4, 5, 7, 8, and 9 to the network function. The manager will try to allocate a CPU for a worker that is NUMA-local to the PCI devices used by the worker.

— –real-time

Use the SCHED_FIFO real-time scheduler for the data-plane processes.

— –on-ingress-drop action

If a data-plane process detects too many dropped packets (by default, 100K packets over 30 seconds), perform action. Available actions are flush, which tells Snabb to re-optimize the code; warn, which simply prints a warning and raises an alarm; and off, which does nothing.

Reconfiguration

The manager of a ptree-based Snabb network function also listens to configuration queries and updates on a local socket. The user-facing side of this interface is snabb config. A snabb config user can address a local ptree network function by PID, but it’s easier to do so by name, so the above example passed --name my-filter to the snabb ptree invocation.

For example, we can get the configuration of a running network function with snabb config get:

$ snabb config get my-filter /
device 83:00.0;
rss-queue 0;
filter "tcp port 80";

You can also update the configuration. For example, to move this network function over to device 82:00.0, do:

$ snabb config set my-filter /device 82:00.0
$ snabb config get my-filter /
device 82:00.0;
rss-queue 0;
filter "tcp port 80";

The ptree manager takes the necessary actions to update the dataplane to match the specified configuration.

Multi-process

Let’s say your clients are really loving this network function, so much so that they are running an instance on each network card on your server. Whenever the filter string updates though they are getting tired of having to snabb config set all of the different processes. Well you can make them even happier by refactoring the network function to be multi-process.

module snabb-pf-v2 {
  namespace snabb:pf-v2;
  prefix pf-v2;

  /* Default filter string.  */
  leaf filter { type string; default ""; }

  list worker {
    key "device rss-queue";
    leaf device { type string; }
    leaf rss-queue { type uint8; }
    /* Optional worker-specific filter string.  */
    leaf filter { type string; }
  }
}

Here we declare a new YANG model that instead of having one device and RSS queue, it has a whole list of them. The key "device rss-queue" declaration says that the combination of device and RSS queue should be unique – you can’t have two different workers on the same device+queue pair, logically. We declare a default filter at the top level, and also allow each worker to override with their own filter declaration.

A configuration might look like this:

filter "tcp port 80";
worker {
  device 83:00.0;
  rss-queue 0;
}
worker {
  device 83:00.0;
  rss-queue 1;
}
worker {
  device 83:00.1;
  rss-queue 0;
  filter "tcp port 443";
}
worker {
  device 83:00.1;
  rss-queue 1;
  filter "tcp port 443";
}

Finally, we need a new setup function as well:

local app_graph = require('core.config')
local pci = require('lib.hardware.pci')
local pcap_filter = require('apps.packet_filter.pcap_filter')

-- Function taking a snabb-pf-v2 configuration and
-- returning a table mapping worker ID to app graph.
return function (conf)
   local workers = {}
   for k, v in pairs(conf.worker) do
      -- Load NIC driver for PCI address.
      local device_info = pci.device_info(k.device)
      local driver = require(device_info.driver).driver

      -- Make a new app graph for this worker.
      local graph = app_graph.new()
      app_graph.app(graph, "nic", driver,
                    {pciaddr=k.device, rxq=k.rss_queue,
                     txq=k.rss_queue})
      app_graph.app(graph, "filter", pcap_filter.PcapFilter,
                    {filter=v.filter or conf.filter})
      app_graph.link(graph, "nic."..device_info.tx.." -> filter.input")
      app_graph.link(graph, "filter.output -> nic."..device_info.rx)

      -- Use DEVICE/QUEUE as the worker ID.
      local id = k.device..'/'..k.rss_queue

      -- Add worker with the given ID and the given app graph.
      workers[id] = graph
   end
   return workers
end

If we place these into analogously named files, we have a multiprocess network function:

$ snabb ptree --name my-filter pf-v2.yang pf-v2.lua pf-v2.cfg

If you change the root filter string via snabb config, it propagates to all workers, except those that have their own overrides of course:

$ snabb config set my-filter /filter "'tcp port 666'"
$ snabb config get my-filter /filter
"tcp port 666"

The syntax to get at a particular worker is a little gnarly; it’s based on XPath, for compatibility with existing NETCONF NCS systems. See the snabb config documentation for full details.

$ snabb config get my-filter '/worker[device=83:00.1][rss-queue=1]'
filter "tcp port 443";

You can stop a worker with snabb config remove:

$ snabb config remove my-filter '/worker[device=83:00.1][rss-queue=1]'
$ snabb config get my-filter /
filter "tcp port 666";
worker {
  device 83:00.0;
  rss-queue 0;
}
worker {
  device 83:00.0;
  rss-queue 1;
}
worker {
  device 83:00.1;
  rss-queue 0;
  filter "tcp port 443";
}

Start up a new one with snabb config add:

$ snabb config add my-filter /worker <<EOF
{
  device 83:00.1;
  rss-queue 1;
  filter "tcp port 8000";
}
EOF

Voilà! Now your clients will think you are a wizard!

Watchdog (lib.watchdog.watchdog)

The lib.watchdog.watchdog module implements a per-thread watchdog functionality. Its purpose is to watch and kill processes which fail to call the watchdog periodically (e.g. hang).

It does so by using alarm(3) and ualarm(3) to have the OS send a SIGALRM to the process after a specified timeout. Because the process does not handle the signal it will be killed and exit with status 142.

— Function watchdog.set milliseconds

Set watchdog timeout to milliseconds. Values for milliseconds greater than 1,000 are truncated to the next second. For example:

watchdog.set(1100) == watchdog.set(2000)

— Function watchdog.reset

Starts the timout if the watchdog has not yet been started and resets the timeout otherwise. If the timeout is reached the process will be killed.

— Function watchdog.stop

Disables the timeout.

Snabblab

Servers devoted to the Snabb project and usable by all known developers.

Want to be a known developer? Sure! Just edit the user account list with your user and send a pull request. No fuss.

Guidelines

Feel at home. These servers are here for you to play with and enjoy.
Please run Snabb processes like this: sudo lock ./snabb .... The lock command will automatically wait if somebody else is running a Snabb process on the same machine and that helps us avoid conflicts for access to hardware resources.
Tell luke@snabb.co your email address(es) to get an invitation to the Lab Slack.
Don’t keep precious data on the servers. We might want to reinstall them at short notice.

Servers

Name	Purpose	SSH	Intel CPU	NICs
lugano-1	General use	lugano-1.snabb.co	E5 1650v3	2 x 10G (82599), 4 x 10G (X710), 2 x 40G (XL710)
lugano-2	General use	lugano-2.snabb.co	E5 1650v3	2 x 10G (82599), 4 x 10G (X710), 2 x 40G (XL710)
lugano-3	General use	lugano-3.snabb.co	E5 1650v3	2 x 10G (82599), 2 x 100G (ConnectX-4)
lugano-4	General use	lugano-4.snabb.co	E5 1650v3	2 x 10G (82599), 2 x 100G (ConnectX-4)
davos	Continuous Integration tests & driver development	lab1.snabb.co port 2000	2x E5 2603	Diverse 10G/40G: Intel, SolarFlare, Mellanox, Chelsio, Broadcom. Installed upon request.
grindelwald	Snabb NFV testing	lab1.snabb.co port 2010	2x E5 2697v2	12 x 10G (Intel 82599)
interlaken	Haswell/AVX2 testing	lab1.snabb.co port 2030	2x E5 2620v3	12 x 10G (Intel 82599)
murren-*	Hydra fleet for tests without NICs	(none)	i7-6700	(none)

Get started

You are welcome to play, test, and develop on the lugano-1 .. lugano-4 servers. Once your account is added you can connect like this:

$ ssh user@lugano-1.snabb.co

and check the PCI devices and their addresses with lspci.

Certain cards (82599 and ConnectX-4) are cabled to themselves. That is, dual-port cards have their ports connected to each other. Certain other cards (X710/XL710) are currently not cabled. If you have special cabling needs then please open an issue on the snabblab-nixos.

Using the lab

All servers run the latest stable version of NixOS Linux distribution.

To quickly install a package:

$ nox <search string>

For other operations such as uninstalling a package, refer to man nix-env.

Questions

If you have any questions or trouble, ask on the #lab channel or open an issue.

Thanks

We are grateful to Silicom for their sponsorship in the form of discounted network cards for chur and to Netgate for giving us jura. Thanks gang!