While having many successes, deep learning has long been criticzed for being incredibly brittle: small, adversarial changes that are often invisible to the naked eye can drastically change the output of our deep models. In this blog post, we explain how our work in provable defenses can create models that are guaranteed to be safe from these attacks for image classification, summarizing ideas that were published in [Wong & Kolter, 2018].

Part 1: The adversarial attack

To defend against an attack, we first must understand what the attack is. One such adversarial attack, and the one we’ll use in this blog post, is to allow each pixel to change by at most some fixed amount $\epsilon$, also known as the $\ell_\infty$-norm bounded attack. For small $\epsilon$ (e.g. 8/255 for each RGB value), targeted perturbations can have a devastating effect on deep networks built for image classification, while being invisible to the human eye.

To demonstrate, let’s use the following image of a frog from the CIFAR10 dataset. The CIFAR10 dataset can be classified to 95% accuracy by a deep residual network called ResNet18, which is able to correctly identify the image as a frog [He et al., 2015].

plt.imshow(to_image(X[4]))

Basic iterative method (PGD based attack)

A widely-used gradient-based adversarial attack uses a variation of projected gradient descent called the Basic Iterative Method [Kurakin et al. 2016]. Typically referred to as a PGD adversary, this method was later studied in more detail by Madry et al., 2017 and is generally used to find $\ell_\infty$-norm bounded attacks. It involves taking steps in the direction of the sign of the gradient to rapidly increase the loss, while projecting back onto the region of allowable perturbations. The code to do this is relatively straightforward and consists of the following parts:

Compute the gradient of our loss, in this case cross entropy loss.
Take a step of size $\alpha$ in the direction of the sign of the gradient.
Project back onto the region of allowable perturbations, in this case the $\ell_\infty$ ball with radius $8/255$.
Project back onto the region of real images, in this case the range $[0,1]$

# PGD step size
alpha = 2/255
# Radius of allowable perturbations
epsilon = 8/255

X_adv = X.clone().detach()
for i in range(20): 
    # 1. Calculate gradient
    X_adv.requires_grad = True    
    loss = nn.CrossEntropyLoss()(model(X_adv), y)
    loss.backward() 
    
    # 2. Step in direction of the sign of the gradient
    X_adv.data = X_adv + alpha*torch.sign(X_adv.grad)
    
    # 3. Project onto the epsilon ball around X
    X_adv.data = torch.min(X_adv, X+epsilon)
    X_adv.data = torch.max(X_adv, X-epsilon)
    
    # 4. Restrict to space of real images
    X_adv.data = torch.clamp(X_adv, min=0, max=1)
    if i % 4 == 0: 
        print(f'[Iteration {i}] Loss: {loss.item()}')
    
print(f'Normal accuracy: {accuracy(X):.2f}, Adversarial accuracy: {accuracy(X_adv):.2f}')

[Iteration 0] Loss: 0.22667884826660156
[Iteration 4] Loss: 6.564219951629639
[Iteration 8] Loss: 9.29906940460205
[Iteration 12] Loss: 10.847380638122559
[Iteration 16] Loss: 11.443120002746582
Normal accuracy: 0.94, Adversarial accuracy: 0.00

In this case, we see that after 20 iterations, we are able to find a perturbation which drastically increases the loss and causes the classifier to be wrong on all 50 examples in this minibatch! But how visible are these perturbations anyway? Let’s examine the same image of the frog after being adversarially perturbed.

plt.imshow(to_image(X_adv[4]))

This frog looks almost indistinguishable to the original frog. The perturbation radius of $\epsilon=8/255$ is way too small to be seen by the human eye, but sufficiently large enough to completely fool our image classifier. This kind of adversarial attack is enough to bring our image classifier from 95% accuracy to 0% under adversarial perturbations on the entire CIFAR10 dataset.

Mathematically, these kinds of attacks can be described as lying within $\mathcal B_\epsilon(x)$, the $\ell_\infty$ ball around an image $x$ with radius $\epsilon$. This is what is often referred to as the threat model for the adversary, which formally describes what the set of perturbed examples that an adversary can use to attack (and what we will be defending against).

$\mathcal B_\epsilon(x) = \{ x + \delta : \|\delta\|_\infty \leq \epsilon\}$

While here we demonstrated one such attack (the basic iterative method), there are many other ways to generate these adversarial examples. A number of these attacks have been compiled into various libraries, including Cleverhans and Foolbox. Some of these attacks are better suited for specific perturbation regions (e.g. there are attacks for $\ell_0$, $\ell_1$, and $\ell_2$-norm bounded perturbations instead of $\ell_\infty$), and sometimes the success of an attack can change drastically under various hyperparameter settings. Most of these attacks are heuristic or approximate, and may not actually find the worst possible adversarial attack within the region.

This matters greatly when evaluating the robustness of a heuristic defense against adversarial examples: using a weak attack with suboptimal parameters can create the false impression that a defense is successful at increasing model robustness, when in reality, the attack used to evaluate the defense was just not strong enough to find an actual adversarial example. While some heuristic defenses have so far held up to their claims [Madry et al. 2017], the vast majority fall short when attacked by a stronger adversary. To get an idea of how often this occurs, [Carlini & Wagner, 2016], [Carlini & Wagner, 2017], [Athalye et al., 2017], [Uesato et al., 2018], [Athalye et al., 2018], [Engstrom et al., 2018], and [Carlini, 2019], form just a subset of papers demonstrating how adversaries can be made even stronger to get around defenses, rendering most heuristic defenses unreliable. As a result, a group of researchers have written a set a guidelines towards proper evaluation of adversarial defenses, which exists as a living document on GitHub [Carlini et al., 2019].

In this blog post, our goal is to take a different approach: rather than creating another heuristic defense and evaluating its robustness with adversarial attacks, instead we try to learn models which we can guarantee to be accurate at not just the input example $x$, but within the entire perturbation region $\mathcal B_\epsilon(x)$. If we can certify that every example within $\mathcal B_\epsilon(x)$ is classified correctly, then we no longer need to test with a heuristic attack: our models are provably safe!

Part 2: Proving safety of a neural network

At a first glance, this seems like a daunting task: there is an infinite number of points within the region, and neural networks are notoriously complicated to analyze. One way to approach this problem is to frame the adversary as a constrained optimization problem:

For a ReLU network with 1 hidden layer (which we will use as our running example, though the techniques do generalize to multiple layers), this can be formalized as the following problem, which we’ll refer to as the adversarial problem:

$\begin{split} \min_{z} \;\;&z_3^T(y - y^\text{target})\\ \textrm{subject to}\;\; z_3 &= W_2 z_2 + b_2\\ z_2 &= \max(\hat z_2, 0)\\ \hat z_2 &= W_1z_1 + b_1\\ \epsilon &\geq \|z_1 - x\|_\infty \end{split}$

where, starting from the bottom up, we have

$\|z_1 - x\|_\infty \leq \epsilon$ constrains the input to the neural network, $z_1$, to be within $B_\epsilon(x)$, or the $\ell_\infty$ ball around $x$ with radius $\epsilon$. This constraint encodes the attack model for the adversary.
The other constraints encode the network architecture, so $\hat z_2 = W_1z_1 + b_1$ encodes the output of the first linear layer, $z_2 = \max(\hat z_2, 0)$ encodes the ReLU activation, and $z_3 = W_2\hat z_2 + b_2$ encodes the output of the second linear layer (which is the output of the network).
The objective, $z_3^T(y - y^\text{target})$, is to minimize the network output of the true class $y$ while maximizing the output of the target class $y^\text{target}$. If the adversary can make the objective negative, then the adversary has succeeded at making the network output $y^\text{target}$ instead of $y$.

This problem represents a worst-case adversarial attack, where an attacker tries to flip the class label from the true class to a target class within a bounded region in the input space. As a result, the solution to this problem is a guarantee against adversarial attacks: no adversary, no matter how strong, can find an adversarial example worse than the minimizer to this optimization problem. Importantly, if we know that the solution to this problem is positive, then we know that the output of the network for the true class $y$ is always greater than the output of the network for the target class $y^\text{target}$ within the perturbation region: this serves as a certificate that guarantees the network is safe against adversarial attacks on the input $x$!

However, solving the adversarial problem is hard. While the objective and almost all of the constraints are all linear constraints, unfortunately the constraint for the ReLU activation, $z_2 = \max(\hat z_2,0)$, is non-convex. Because of this, gradient-based methods like the attack discussed earlier (the Basic Iterative Method) are not guaranteed to find the global minimum. It is still possible to solve this problem exactly, for example by using

SMT solvers [Huang et al., 2016], [Katz et al., 2017], [Ehlers, 2017]
Integer programming [Lomuscio & Maganti, 2017], [Cheng et al., 2017], [Tjeng et al., 2017]

These types of solvers are typically combinatorial in runtime, and generally don’t scale to medium-sized or larger networks, especially the convolutional networks traditionally used for image classification. The most scalable so far is one of the mixed-integer programming solvers: utilizing some of the ideas presented in this blog post, the authors of [Tjeng et al., 2017] were able to improve their solver and verify medium sized networks. Their solver is written written in Julia and available on GitHub.

Rather than solving for the worst-case adversarial example, the approach we’ll take in this blog post is to instead compute a tractable bound on the optimal solution. The idea is that a bound on the worst-case adversary can be cheaper (not combinatorial in complexity) to compute than the exact worst case adversary, while still providing a guarantee on the adversarial output. Specifically, we will compute a lower bound on the solution to the adversarial problem, which is a (looser) guarantee on the worst-case adversarial output.

The following diagram visually demonstrates the approach. The adversarial problem effectively takes the region an adversary can attack (the gray region on the left) and passes it through the deep network, resulting in a possibly complicated region in the output space (the gray region on the right). This set of network outputs that an adversary can produce is called the adversarial polytope, and the goal of the adversary is to minimize over this region. Our approach can then be interpreted as bounding the adversarial polytope with more tractable overapproximations of the adversarial polytope, depicted by the red and blue regions.

Bounding the adversarial problem

To make this problem more tractable, we will relax the ReLU constraint into a convex one. First, recall that the ReLU constraint requires that the pre and post activations, $(\hat z, z)$, must lie within the set $\{(\hat z, z) : z = \max(0,\hat z))\}$, as shown in the following graph.

To turn this into a convex region, suppose we knew that the input to the ReLU, $\hat z$, had to lie within an interval $[\ell, u]$.

Then, there are three possibilities depending on the bounds $\ell$ and $u$:

If $\ell \geq 0$, then we know that the ReLU activation is always within the linear region, and we can replace the ReLU constraint with $\hat z = z$.

If $u \leq 0$, then we know that the ReLU activation is always in the zero region, and we can replace the ReLU constraint with $\hat z = 0$.

If $\ell < 0 < u$, then will take the convex envelope of the ReLU between $\ell$ and $u$. Specifically, this is the triangular region formed by the points $(\ell,0)$, $(u,u)$, and $(0,0)$. We can express this region as a set of three inequalities: the region below the line connecting $(\ell,0)$ and $(u,u)$, above the line $z = 0$, and above the line $z = \hat z$. Then, we can replace the ReLU activation with the convex set $\mathcal C(\ell,u)$ defined by these inequalities: $\mathcal C(\ell,u) = \left\{(\hat z, z) : -u\ell \geq z(u-\ell) - u\hat z, \; z \geq 0, \; z \geq \hat z\right\}$

For notational convenience, note that if we let $\tilde u = \max(u,0)$ and $\tilde \ell = \min(\ell,0)$, then all three cases can be compactly written for all the activations as $\mathcal C(\tilde \ell, \tilde u)$.

Going back to our network example with one hidden layer, let $(\ell_2,u_2)$ be the vectors of lower and upper bounds for the pre-activation $\hat z_2$. Note that since $(\ell_2,u_2)$ are vectors, the convex relaxation $\mathcal C(\tilde \ell_2, \tilde u_2)$ is now applied element-wise to each activation. Replacing the ReLU constraint with the constraint $(\hat z_2, z_2) \in \mathcal C(\tilde \ell, \tilde u)$ results in the following relaxed adversarial problem:

$\begin{split} \min_{z} \;\;&z_3^T(y - y^\text{target})\\ \textrm{subject to}\;\; & z_3 = W_2 z_2 + b_2\\ & (\hat z_2, z_2) \in \mathcal C(\tilde \ell,\tilde u)\\ & \hat z_2 = W_1z_1 + b_1\\ & \|z_1 - x\|_\infty \leq \epsilon \end{split}$

which is now a convex problem, since the constraints are now all linear. In fact, since the objective is also linear, it is actually a linear program and can in theory be solved using off-the-shelf solvers (e.g. Gurobi). Since we relaxed the ReLU constraint to a larger set, the solution to this linear program is a lower bound on the original adversarial problem.

Computing the lower and upper bounds $(\ell,u)$

Up until now, we’ve just assumed that these lower and upper bounds $\ell,u$ existed, but how do we even get these to begin with? Since the first ReLU activation follows a linear layer, we can actually do a straightforward lower and upper bound calculation by propagating the input perturbation through the first linear layer.

More specifically, from the attack model, we know that a perturbed input to the neural network must be $z_1 = x + \delta$ where $\|\delta\|_\infty \leq \epsilon$. In combination with the first linear layer, we know that

$\hat z_2 = W_1z_1 + b_1 = W_1(x + \delta) + b_1 = W_1x + b_1 + W_1\delta$

This can be easily maximized and minimized over $\|\delta\|_\infty \leq \epsilon$, giving rise to the lower and upper bounds for the first set of ReLU activations. These bounds have the form $(\ell,u) = W_1x_1 + b_1 \pm \epsilon \|W_1\|_1$, where $\|W_1\|_1$ is the vector of $\ell_1$ norms for each row of $W_1$. While this is sufficient for the first layer, at the end of this post we will discuss how one can efficiently compute bounds for later layers.

Visualization with a 2D example

Before we continue, we’ll first generate a 2-dimensional synthetic binary classification problem. The dataset will consist of 12 data points, randomly sampled to be least $2r$ apart, and randomly assigned to one of two classes. The goal is to learn a classifier that separates the two classes while being safe from adversarial attacks with $\ell_\infty$ radius at most $\epsilon=r/2$.

# random points at least 2r apart
m = 12
np.random.seed(3)
x = [np.random.uniform(size=(2))]
r = 0.16
while(len(x) < m):
    p = np.random.uniform(size=(2))
    if min(np.abs(p-a).sum() for a in x) > 2*r:
        x.append(p)
epsilon = r/2

x = np.array(x)
y = np.random.randint(0,2,m)

We can train a typical neural network with one hidden layer of 100 units to classify these twelve points into their two classes. Since this is a simple problem, we are able to learn a classifier which can perfectly separate the two classes and reach 0% error, using cross-entropy loss and the Adam optimizer.

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(4)

net = nn.Sequential(
    nn.Linear(2,100),
    nn.ReLU(),
    nn.Linear(100,2)
)

opt = optim.Adam(net.parameters(), lr=1e-3)
for i in range(1000):
    out = net(torch.from_numpy(x).float())
    l = nn.CrossEntropyLoss()(out, torch.from_numpy(y))
    err = (out.max(1)[1].data != torch.from_numpy(y)).float().mean()
    if i % 100 == 0:
        print(f"Loss: {l.item():.2f}, Err: {err:.2f}")
    opt.zero_grad()
    (l).backward()
    opt.step()

Loss: 0.70, Err: 0.58
Loss: 0.59, Err: 0.42
Loss: 0.55, Err: 0.33
Loss: 0.49, Err: 0.33
Loss: 0.42, Err: 0.17
Loss: 0.36, Err: 0.08
Loss: 0.31, Err: 0.08
Loss: 0.26, Err: 0.00
Loss: 0.22, Err: 0.00
Loss: 0.18, Err: 0.00

Since this dataset is 2-dimensional, we’ll be able to plot visualizations of the datapoints, the regions which the adversary can attack, and the learned decision boundary. The two classes of datapoints are plotted in blue and red, and the region of allowable perturbations, the $\ell_\infty$ balls around each datapoint, are the squares around each point. The blue and red regions in the background depict the decision boundaries learned by the classifier.

XX, YY = np.meshgrid(np.linspace(-0.1, 1.1, 100), np.linspace(0, 1, 100))
X0 = torch.Tensor(np.stack([np.ravel(XX), np.ravel(YY)]).T)
y0 = net(X0)
ZZ = (y0[:,0] - y0[:,1]).view(100,100).data.numpy()

fig, ax = plt.subplots(figsize=(8,8))
ax.contourf(XX,YY,-ZZ, cmap="coolwarm", levels=np.linspace(-1000,1000,3))
ax.scatter(x[:,0], x[:,1], c=y, cmap="coolwarm", s=70)
ax.axis("equal")
ax.axis([0,1,0,1])

for a in x:
    ax.add_patch(patches.Rectangle((a[0]-r/2, a[1]-r/2), r, r, fill=False))

2D example: verification using CVXPY

From the visualization, we can already see that the decision boundary intersects the adversarial attack regions for some of the datapoints. Nonetheless, we can still attempt to prove safety for as many points as possible. Since the relaxed adversarial problem is a linear program, we can solve it using CVXPY.

First, we set up the CVXPY variables for our optimization problem, and extract the parameters of our neural network.

import cvxpy as cp
z1, z2, z2hat, z3 = cp.Variable(2), cp.Variable(100), cp.Variable(100), cp.Variable(2)

W1, b1 = net[0].weight.detach().numpy(), net[0].bias.detach().numpy()
W2, b2 = net[2].weight.detach().numpy(), net[2].bias.detach().numpy()

Next, given an input example $x$, we compute the lower and upper bounds for the ReLU activations. Recall that the lower and upper bounds for the first ReLU activation could be computed with $W_1x_1 + b_1 \pm \epsilon \|W_1\|_1$.

def bounds(x): 
    x2 = W1.dot(x) + b1
    l,u = x2 - epsilon*np.abs(W1).sum(1), x2 + epsilon*np.abs(W1).sum(1)
    return np.minimum(l,0), np.maximum(u,0)

Given the lower and upper bounds $(\ell,u)$, we can define the (relaxed) network constraints of the linear program. Remember that since we are using the linear programming relaxation of the adversarial problem, the original ReLU constraint is replaced by the convex relaxation consisting of the three linear inequalities.

def network_constraints(l,u):
    return [
        # linear layer
        z3 == W2*z2hat + b2, 
        # relaxed relu constraints
        z2hat >= 0, 
        z2hat >= z2, 
        cp.multiply(z2hat,(u-l)) - cp.multiply(u,z2) <= -u*l, 
        # linear layer
        z2 == W1*z1 + b1,
    ]

Next, given an input example $x$, we define the constraint on $z_1$, the input to the neural network (the threat model). In this case, this is the $\ell_\infty$ ball around the input example, which corresponds to two more inequality constraints.

def input_constraints(x): 
    return [z1 <= x+epsilon, z1 >= x-epsilon]

Finally, given the true class label $y$, we can compute the adversarial objective, which is to minimize the difference between the true class output and the target class. Since this is a binary classification problem, the target class is the remaining other class.

def objective(y): 
    c = (np.eye(2)[y]*2-1)
    return cp.Minimize(z3*c)

Combining these all together, we can now solve the linear program with CVXPY. For each example in the dataset, we first calculate the activation lower and upper bounds, create the constraints of the linear program, and create the objective. Then we let CVXPY solve the problem, and check the final objective. If the objective is positive, then we are guaranteed that the example is certifiably safe.

certified = []
for x0,y0 in zip(x,y): 
    # calculate the bounds
    l,u = bounds(x0)
    
    # create the constraints
    constraints = network_constraints(l,u) + input_constraints(x0)
    
    # create the objective
    obj = objective(y0)

    prob = cp.Problem(obj, constraints)
    prob.solve()
    
    certified.append(prob.value >= 0)
    
print(f"Total certified: {sum(certified)}/{len(certified)} examples")

Total certified: 8/12 examples

The linear program was able to certify 8 out of the 12 examples! In other words, we have proven that for the trained neural network, 8 of the points are immune to any adversarial attack within the perturbation region. Let’s take a closer look at the examples that we could not certify, which are outlined with thick red borders in the visualization below.

fig, ax = plt.subplots(figsize=(8,8))
ax.contourf(XX,YY,-ZZ, cmap="coolwarm", levels=np.linspace(-1000,1000,3))
ax.scatter(x[:,0], x[:,1], c=y, cmap="coolwarm", s=70)
ax.axis("equal")
ax.axis([0,1,0,1])

for a,cert in zip(x,certified):
    if cert: 
        ax.add_patch(patches.Rectangle((a[0]-r/2, a[1]-r/2), r, r, fill=False))
    else: 
        # mark uncertified examples with a thick red border
        ax.add_patch(patches.Rectangle((a[0]-r/2, a[1]-r/2), r, r, fill=False, edgecolor='r', linewidth=5))

We can see that for three of the four examples, the perturbation regions cross the decision boundary, and so of course they are not certifiable. However, you may have noticed an anomaly with the fourth example on the right: we can visually confirm that the example is “safe” but the linear program was unable to certify it. This is because the optimization problem is a lower bound on the original problem: in order to relax the ReLU activation constraints to the convex, triangular region, we must pay the penalty of looseness in the bound. So for this example, even though the exact adversarial problem can be certified, the linear program is too loose and outputs a negative certificate, which is not a guarantee. As it turns out, directly using this relaxation on typical networks trained on real datasets is far too loose to be useful.

Part 3: Training a provably robust network

While the previous section showed a way to verify networks, it has two major shortcomings on real world problems. First, it is simply too loose to provide meaningful guarantees on standard networks. It is also expensive: even though it is not combinatorially expensive in runtime, solving a linear program is still not scalable to large networks, and is not very fast for use during training. However, in this section we will show that it is possible to compute a computationally tractable bound to the linear program. While this bound is inherently even more loose, we can counteract this with training by learning networks which minimize the bound, and get certified neural networks.

Bounding the linear program with its dual

We first start by recognizing that since our optimization problem is a linear program, there exists a dual linear program. The dual linear program is a standard concept from optimization arising from duality theory, and has two key characteristics:

The dual of a linear program is another linear program, which has a standard, explicit form that can be found in most convex optimization courses.
Any feasible point of the dual linear program is a lower bound on the optimal value of the original (minimization) linear program.

The first property allows us derive the dual of our linear program (the relaxed adversarial problem), while the second property allows us to efficiently bound our linear programming bound on the adversarial problem. Understanding the details for how to derive the dual problem is not important to this blog post but can be found in the appendix of our paper [Wong & Kolter, 2018]. The resulting dual problem is the following linear program, which is an optimization problem over the dual variables $\nu$ and $\alpha$:

$\begin{split} \max_{\nu,\alpha\geq 0} \;\;& -\nu_1^Tx - \epsilon \|\nu_1\|_1 - \nu_2^Tb_1 + \sum_{j \in \mathcal I}\ell_{j}[\nu_2]_+ - \nu_3^Tb_2\\ \textrm{subject to}\;\; & \nu_3 = -(y - y^\text{target})\\ & \hat \nu_2 = W_2^T\nu_3\\ &\nu_{2,i} = 0 &\quad \text{if}\; u_i \leq 0\\ &\nu_{2,i} = \hat \nu_{2,i} &\quad \text{if}\; \ell_i \geq 0\\ &\nu_{2,i} = \frac{u_i}{u_i-\ell_i}\odot[\hat \nu_{2,i}]_+ - \alpha_i\odot [-\hat \nu_{2,i}]_+ &\quad \text{if}\; \ell_i < 0 < u_i\\ &\nu_1 = W_1^T\nu_2 \end{split}$

where $[\nu]_+ = \max(0,\nu)$, $\|\nu\|_1 = \sum_i |\nu_i|$ denotes the $\ell_1$ norm, and $\odot$ denotes elementwise multiplication.

There are two notable observation here:

Only $\alpha$ is a free variable. Once $\alpha$ is set, all $\nu_i$ have a fixed value. So one way to get a cheap bound on the original linear program is to simply set $\alpha$ to be some fixed value and evaluate the objective term, since any dual feasible point is a bound on the original problem. In practice, we can take $\alpha = \frac{u}{u-\ell}$, which simplifies the dual constraint for $\nu_2$ to (in vector form)

$\nu_2 = \frac{\tilde u}{\tilde u - \tilde \ell}\odot \hat \nu_2$

The constraints of the dual problem can be viewed as a forward pass through another neural network which we call the dual network. Starting with $\nu_3 = -(y - y^\text{target})$ as the “input” to the dual network, we then pass it through various layers until we get to $\nu_1$. By keeping all intermediate values $\nu_i$, we can then compute the objective with a single pass through the dual network.

The dual network here (the propagation of variables from $\nu_3$ to $\nu_1$) may look oddly familiar: it is almost identical to the backwards pass of the original network during backpropagation, with the only difference being the additional free variables $\alpha$.

We can let $g$ be the dual network encoded by these constraints, so $(\nu_1,\nu_2,\nu_3) = g(y - y^\text{target};\alpha)$, and let $J(\nu)$ be the objective of the dual problem. Then in the end, for a fixed alpha, the dual objective can be written as

$J(g(y-y^\text{target}; \alpha))$

This can be interpreted as a “loss function” $J$ applied to a “neural network” $g$ evaluated on an “input” $y-y^\text{target}$. This is exactly the setting that standard deep learning toolkits excel at, so we can evaluate and train to minimize the bound using standard deep learning toolkits!

To use this in training, we can calculate $J(g(y-y^\text{target}))$ for all possible $y^\text{target}$, and use it as a drop in replacement for the output of our network, typically into the cross entropy loss function.

To use this for certification, since we are computing a dual feasible solution, we are guaranteed that our resulting dual objective is a lower bound on the original linear program, and consequently is a lower bound on the original adversarial problem. As a result, when $J(g(y - y^\text{target})) > 0$ for all possible targets $y^\text{target} \neq y$, we are guaranteed that there does not exist an attack which can change the label from $y$, since the bound serves as a proof that the example is safe from any adversarial attack.

2D example: Training the dual bound with PyTorch

We can now calculate the dual objective for the 2D example, and backpropagate to learn a provably robust network using PyTorch. We will first start with the same architecture considered earlier, converting our numpy dataset over to PyTorch tensors.

torch.manual_seed(4)

net = nn.Sequential(
    nn.Linear(2,100),
    nn.ReLU(),
    nn.Linear(100,2)
)

opt = optim.Adam(net.parameters(), lr=1e-3)
X_pt, y_pt = torch.from_numpy(x).float(), torch.from_numpy(y)

W1, b1 = net[0].weight, net[0].bias
W2, b2 = net[2].weight, net[2].bias

We’ll compute the upper and lower bounds for the ReLU activation using the same method as before, but now for PyTorch tensors.

def bounds_pt(X, net): 
    X2 = F.linear(X, W1, bias=b1)
    l,u = X2 - epsilon*W1.abs().sum(1), X2 + epsilon*W1.abs().sum(1)
    return torch.clamp(l,max=0), torch.clamp(u,min=0)

Next, we’ll construct the dual network as a standard PyTorch module, which takes an input to the dual network $c$ along with the lower and upper bounds $(\ell,u)$. It’s implemented using the transpose linear layers and the dual feasible solution for $\alpha$ as described earlier.

class DualNet(nn.Module): 
    def forward(self, c, l, u): 
        nu3 = -c
        nu2hat = F.linear(nu3, W2.t())
        nu2 = u/(u-l)*nu2hat
        nu1 = F.linear(nu2, W1.t())
        return (nu1,nu2,nu3)

dualnet = DualNet()

Given the dual network outputs $(\nu_1, \nu_2, \nu_3)$, and the lower bound $\ell$, we can create a function that computes the dual objective. To keep the code from being too verbose, we also use a helper function that performs the dot product of two vectors in batch mode called bdot(X,Y).

def bdot(X,Y): 
    return torch.matmul(X.unsqueeze(-2),Y.unsqueeze(-1)).squeeze(-1).squeeze(-1)

def J(nu1, nu2, nu3, l): 
    return (- bdot(nu1, X_pt) - epsilon*nu1.abs().sum(-1)- bdot(nu2, b1) 
            + (l*torch.clamp(nu2,min=0)*I.float()).sum(-1) - bdot(nu3,b2))

Finally, we can put all these parts together and train against our bound on the worst case adversarial attack. Specifically, for each epoch, we first compute the lower and upper bounds of the ReLU activations. Then, we pass the corresponding vectors for each possible attack target through the dual network with the calculated lower and upper bounds. Finally, we compute the dual objective, drop it into the cross entropy loss function, and backpropagate. Note that the dual objective when $y = y^\text{target}$ is zero, and that we drop in the negative objective since we want to increase the certificate to become positive.

import torch.nn.functional as F
for i in range(5000):
    
    # Compute upper and lower bounds
    l,u = bounds_pt(X_pt, net)
    I = (u > 0)*(l < 0)
    
    # Pass through dual network
    c = torch.eye(2)[y_pt]*2-1
    nu1,nu2,n3 = dualnet(c,l,u)
    
    # Compute objective and loss
    dual_obj = J(nu1,nu2,nu3,l)
    loss = nn.CrossEntropyLoss()(-dual_obj.unsqueeze(-1)*(torch.eye(2)[~y]), y_pt)
    
    # Compute error and robust error
    out = net(torch.from_numpy(x).float())
    err = (out.max(1)[1].data != y_pt).float().mean()
    robust_err = (dual_obj < 0).float().mean()
    
    if i % 500 == 0:
        print(f"loss: {(loss).item():.2f}, err: {err.item():.2f}, robust err: {robust_err.item():.2f}")
    opt.zero_grad()
    (loss).backward()
    opt.step()

loss: 0.71, err: 0.58, robust err: 0.58
loss: 0.61, err: 0.17, robust err: 0.42
loss: 0.54, err: 0.17, robust err: 0.25
loss: 0.48, err: 0.17, robust err: 0.33
loss: 0.42, err: 0.00, robust err: 0.17
loss: 0.37, err: 0.00, robust err: 0.17
loss: 0.33, err: 0.00, robust err: 0.08
loss: 0.29, err: 0.00, robust err: 0.00
loss: 0.26, err: 0.00, robust err: 0.00
loss: 0.23, err: 0.00, robust err: 0.00

We can also compute some statistics, namely the loss of the dual objective, the standard error rate, and the robust error rate. The robust error rate is the the fraction of examples which are not certified (so the dual objective is negative), which provides an upper bound on the worst case adversarial attack. In other words, the robust error is a guarantee that our model cannot be attacked to worse than the robust error rate. Since the network eventually achieves a 0% robust error rate, every example in the dataset is guaranteed to be safe against adversarial attacks.

Finally, let’s once again visualize the classifier that we have learned. In contrast to the standard network, we now see that all the squares representing the perturbation regions are now fully contained in either the red or the blue classes. This is to be expected, since the network was able to certify every point in the dataset.

y0 = net(X0)
ZZ = (y0[:,0] - y0[:,1]).view(100,100).data.numpy()

fig, ax = plt.subplots(figsize=(8,8))
ax.contourf(XX,YY,-ZZ, cmap="coolwarm", levels=np.linspace(-1000,1000,3))
ax.scatter(x[:,0], x[:,1], c=y, cmap="coolwarm", s=70)
ax.axis("equal")
ax.axis([0,1,0,1])

for a in x:
    ax.add_patch(patches.Rectangle((a[0]-r/2, a[1]-r/2), r, r, fill=False))

Computing upper and lower bounds via the dual network

Finally, within this framework there is a straightforward, inductive way to compute upper and lower bounds for the ReLU activations: we can write the problem of finding lower and upper bounds for each ReLU activation in almost the same way as the adversarial problem with two differences. First, we only use a subset of the network up to the activation in question, and second, we use a different adversarial objective. For example, we can write the problem of finding a lower bound to an activation in the $k$th layer as

$\begin{split} \min_{z} \;\;&z_k^Te_i\\ \textrm{subject to}\;\; z_{i+1} &= W_i z_i + b_i\;\; \text{for}\;\;i=1...k-1\\ z_i &= \max(\hat z_i, 0) \;\; \text{for}\;\; i=2\dots k-1\\ \epsilon &\geq \|z_1 - x\|_\infty \end{split}$

Note that this is structurally identical to the problem of finding an adversarial example on a $k$ layer network, but with the objective $z_k^Te_i$ instead of $z_k^T(y - y^\text{target})$. In other words, the “adversary” is trying to minimize the output of the $i$th activation at the $k$th layer. As a result, the entire framework can apply towards calculating lower and upper bounds, by passing in $\pm e_i$ through a subset of the dual network and calculating the same dual objective, which gives lower and upper bounds for the $i$th activation in the $k$th layer. In fact, by caching intermediate results, it is possible to compute upper and lower bounds for all activations with a single pass through the dual network. For more details, see Algorithm 1 of the main paper.

Summary

In this post, we showed how to derive provable bounds on all possible adversarial attacks within the $\ell_\infty$ threat model. We started by formulating the adversary as a non-convex optimization problem. We then bounded the adversary with an easier linear program, and bounded it again with a dual feasible solution. Finally, by training against the dual objective, the lower bound becomes large enough to actually guarantee safety.

Provable defenses against adversarial attacks

Using ideas from linear programming and duality to create unbreakable, certified adversarial defenses