It’s become well-known in recent years that computer vision models are not robust to adversarial examples. By altering input data in what are often human-imperceptible ways, one can completely fool image classifiers into making wrong decisions with confidence. Moreover, it’s been shown that even black-box models are vulnerable from adversarial examples trained on completely different models. In this report we examine the effects of black-box adversarial machine learning on two image datasets, GTSRB and Cifar-10. We demonstrate that in both cases black-box classification model accuracies can be substantially affected by the generation of adversarial examples.
While all software has its own set of security vulnerabilities, software incorporating machine learning has its own set of vulnerabilities to deal with. In particular, computer vision models are prone to manipulation by the creation of adversarial examples. Adversarial examples are inputs designed specifically to fool the model into making incorrect decisions with high confidence.
An attacker can use adversarial examples to do things like trick an object detection system in an autonomous vehicle into classifying a stop sign as a yield sign, which could easily cause an accident. There are even more dangers with an online-learning model, as an attacker can also use adversarial examples to poison such a model by using bad training data to alter the behavior of the model and make it completely unusable.
Adversarial examples can be crafted by altering clean input data in ways that are often imperceptible to humans. In one famous example, suppose one desires to fool an image classifier designed to classify ImageNet images. As shown in Figure 1, one can take an image of a panda and alter it by adding imperceptible amounts of noise to fool the model into classifying it as a gibbon with high confidence.
It has also been shown that adversarial examples are transferable. That is, adversarial examples created to fool one computer vision model can often be used to fool other models as well. This allows adversarial computer vision to be performed in a black-box setting. In a black-box setting, one supposes that an attacker wishes to subvert the intent of a computer vision model about which he has little to no knowledge. He doesn’t know which model was used or on what data it was trained. He only has a rough idea of what the input and output data look like.
For example, in the above ImageNet example, the attacker may have an idea that the black-box model classifies ImageNet-like images, e.g. through querying the model multiple times to understand its signature, but little idea what model was used or exactly which images were used in training. To get around this, he can instead train his own model on a set of images he thinks is close to what the black-box was trained on and use that model to create adversarial images to attack the black-box.
Machine learning is the construction of algorithms that learn from data and can make predictions about data without the need of human input. These algorithms are generally statistical, in the sense that they use a sample of data to uncover information about their underlying distribution. Machine learning models that capture the underlying distribution well are said to generalize. The most well-developed and commonly utilized sub-field of machine learning is supervised learning, which is where most applications of adversarial machine learning tend to apply.
In supervised learning, the goal is to use a set of inputs to predict a set of outputs. More formally, suppose a dataset contains pairs of feature vectors and targets sampled from a joint probability distribution . Suppose this distribution can be expressed by a function plus noise , so . The goal of supervised learning, then, is to use an algorithm to learn a function from some model class such that for all . An illustration of the supervised learning process can be shown in Figure 2.
This process can be easily understood with the simple example of logistic regression. Logistic regression is a simple type of classifier, i.e. a supervised learning algorithm in which the target space discrete, in which the outputs are usually called labels. For logistic regression, each and , and is assumed to be a binary-valued function on . The model class is composed of parametric functions of the form
where are parameters to be estimated from the learning algorithm. The learning algorithm is typically a simple optimization algorithm that seeks to minimize some loss function defined on the data with respect to the model parameters. A simple optimization algorithm for doing so is gradient descent, which updates the parameters according to the rule
until convergence, where is a predefined learning rate that determines the convergence rate.
An extension of logistic regression is the neural network, which is the workhorse class of models for modern deep learning methods. An -layered neural network is a composition of non-linear, parametric activation functions ,
Note . The exact structure of each activation function depends on the type of layers desired. For example, a convolutional neural network (CNN) contains convolutional layers of the form , usually combined with fully-connected layers of the form for some increasing function .
In adversarial machine learning, one attempts to subvert the supervised learning process by crafting adversarial examples, i.e. feature vectors such that even approximately. The simplest and most common way to craft such examples is via the fast sign gradient method (FSGM), which uses gradient descent in a slightly different way. One takes a clean example and perturbs it by an amount
with predefined perturbation parameter , to get . Further improvements can be made by then attempting to maximize the loss via constant, -sized gradient ascent steps on the inputs,
The perturbation can be thought of as an additive noise term, and is usually made to be small enough such that the adversarial examples resembles the original sample as much as possible.
The approach described above implicitly assumes an attacker has knowledge of the underlying model due to the dependence of the FSGM method on the loss, which itself depends on the model output. It’s been shown empirically that adversarial examples are often transferable [3]. That is, adversarial examples trained on one model can be transferred to a different model and still often act as an adversarial examples on that model.
While adversarial machine learning has been studied in some form for the past couple of decades, its modern incarnate is based on the work by Goodfellow, Shlens, and Szegedy in [1]. It’s in this work that attention was called to the fact that images can be altered imperceptibly to produce adversarial misclassifications with high confidence. This work also introduced the fast sign gradient method, as well as the vulnerability of even the simplest machine learning models to adversarial attacks, not just deep learning methods.
The vulnerability of machine learning models to black-box attacks was called attention to in [4] and [3]. The work in [4] motivated this vulnerability by reasoning that the attack surfaces between different models often look very similar. The work in [3] demonstrated the viability of black-box attack methods by attacking an image classifier independently trained and deployed on a remote server.
Frameworks for doing adversarial computer vision are fairly new. The first, perhaps, was the CleverHans library in [5]. CleverHans is a Python library compatible with TensorFlow, used to benchmark the vulnerability of machine learning systems to adversarial examples. Another library is FoolBox [6], a multi-framework compatible Python toolbox to create adversarial examples that fool neural networks. Another library is the Adversarial Robustness Toolbox (ART) [7], a multi-framework compatible Python library that allows for the rapid crafting and analysis of attacks and defense methods for machine learning models.
In our experiments, we employ a simplified black-box methodology to generate adversarial examples and attack classifiers on two popular image classification datasets, GTSRB and CIFAR-10. In each case, we employ the following methodology:
The first vision model we choose to attack is a traffic sign classifier. The classifier takes as input an image containing a traffic sign and outputs a label corresponding to what type of traffic sign it is. The dataset used is the German Traffic Sign Recognition Benchmark (GTSRB) dataset from [8], which contains 51837 images of German 43 different classes of traffic signs. Some examples of these images and their labels are shown in Figure 3.
For conducting the black-box attack, we use 39208 of the images for the black-box dataset and the remaining 12629 for the substitute dataset. Each image is first lightly processed by performing histogram equalization and center cropping, and then resized to a standard size of .
For the black-box model, we chose to fine-tune the pretrained model from VGG-16 [9]. This technique is an instance of transfer learning, and has been shown to work very well with image classification. VGG-16 is a CNN consisting of 16 layers of weighted layers; its architecture is shown in Figure 4. Using fine-tuning, we train the black-box model in Keras to a 96% test-set accuracy. For the substitute model, we trained a custom CNN with 8 weighted layers using Keras to a test-set accuracy of 97%.
Next, we use the ART implementation of FSGM () to create 100 adversarial examples on the substitute CNN model. The prediction accuracy of both models for the original examples are shown in Table 1, and for the adversarial examples in Table 2. We can see in particular that, while the adversarial examples are much more effective at misclassifying on the substitute model (they were created using that model, after all), they are still effective at misclassifying on the black-box model as well. An example of the black-box prediction for one of these examples is shown in Figure 5. Observe that, while the black-box model could predict this image is a 100 km/h speed limit sign with perfect confidence, the adversarial image was predicted to be a 120 km/h speed limit sign with 98% confidence, despite the fact that the sign is still obviously a 100 km/h sign to a human.
Table 1: Traffic sign classifier results with original examples.
Accuracy (%) | |
---|---|
Substitute | 100 |
Black-Box | 85 |
Table 2: Traffic sign classifier results with adversarial examples.
Accuracy (%) | |
---|---|
Substitute | 13 |
Black-Box | 57 |
The second vision model we chose to attack is a classifier trained using the CIFAR-10 dataset from [10]. The CIFAR-10 dataset is organized by the Canadian Institute for Advanced Research and is comprised of 60,000 low resolution ( pixels) images: 50,000 training images and 10,000 testing or evaluation images. The dataset is divided evenly between ten different classes and is used as a benchmark for performance for larger and more diverse classification datasets such as CIFAR-100 and ImageNet. Some examples of these images and their labels can be seen in Figure 6.
For training the black-box model, we use all of the 50,000 training images in the CIFAR-10 dataset. The substitute model is trained using 7,000 of the 10,000 test images from the CIFAR-10 dataset. The remaining 3,000 are used for testing and validation of both models.
For the black-box model, we chose to use a wide ResNet (WRN) model [11]. WRN is an adaptation of residual networks (ResNets) built on residual learning blocks; its architecture can be seen in Figure 7. WRNs have demonstrated exceptional classification accuracy even compared to extremely deep ResNets and are faster to train and less susceptible to diminishing feature reuse. The WRN architecture was used as opposed to a very deep ResNet due to increased memory efficiency, faster training time, and comparable performance. The substitute model used for this experiment was an untrained version of the VGG-16 model used above. Using fine-tuning, we train the WRN black-box model to 83% test-set accuracy.
Next, we use the cleverhans implementation of FSGM () to create adversarial examples on the substitute VGG-16 model. The prediction accuracy of the black-box model for the original and adversarial examples is shown in Table 3. Again, we see that the black-box classifier, despite being of a different architecture and trained using a different dataset as the previous experiment, is susceptible to adversarial examples crafted using an alternative substitute network.
Table 3: Black-Box CIFAR-10 classifier results with original vs. adversarial examples.
Accuracy (%) | |
---|---|
Original | 83 |
Adversarial | 12 |
In this report, we have shown that a black-box machine learning classification models in computer vision are vulnerable to adversarial attacks. By training adversarial examples on a substitute model that may bear little relation to the black-box model, an attacker can successfully subvert the intent of the black-box model and use model misclassification to achieve some desired behavior. We must point out, however, that adversarial computer vision isn’t just a deep learning problem. In fact, [1] argues that pretty much all machine learning models are just as vulnerable.
A natural question to ask at this point is, if black-box models are so easily prone to adversarial attacks, what can be done to protect them from such attacks? One easy thing to try would be to use adversarial examples during training to inoculate the model from being attacked with such examples. While such an approach may help with certain types of adversarial example crafting methods, it won’t work for all possible adversarial examples. There are other defense strategies as well, but this is still very much an area of active research. Before we can better understand defense strategies, we must better understand why adversarial examples occur so easily in the first place.
Addressing the efficacy of various defense strategies is a natural thing to add to future work. Another thing to add to future work is the creation of more realistic attack scenarios. In this report, for simplicity we trained both the black-box and substitute models, and sampled the training data for both from the same original dataset. In more realistic scenarios, the attacker would likely have no knowledge at all of how the black-box model was trained or what dataset was used to train it. Making use of this even more limited information may be insightful.
[1] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[2] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data. AMLBook, 2012.
[3] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, (New York, NY, USA), pp. 506–519, ACM, 2017.
[4] C. Szegedy, G. Inc, W. Zaremba, I. Sutskever, G. Inc, J. Bruna, D. Erhan, G. Inc, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in ICLR, 2014.
[5] N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko, V. Behzadan, K. Hambardzumyan, Z. Zhang, Y.-L. Juang, Z. Li, R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot, P. Hendricks, J. Rauber, and R. Long, “Technical report on the cleverhans v2.1.0 adversarial examples library,” arXiv preprint arXiv:1610.00768, 2018.
[6] J. Rauber, W. Brendel, and M. Bethge, “Foolbox: A python toolbox to benchmark the robustness of machine learning models,” arXiv preprint arXiv:1707.04131, 2017.
[7] M.-I. Nicolae, M. Sinn, M. N. Tran, A. Rawat, M. Wistuba, V. Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. Molloy, and B. Edwards, “Adversarial robustness toolbox v0.3.0,” CoRR, vol. 1807.01069, 2018.
[8] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, “Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark,” in International Joint Conference on Neural Networks, no. 1288, 2013.
[9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[10] A. Krizhevsky, “Learning multiple layers of features from tiny images,”University of Toronto, 05 2012.
[11] S. Zagoruyko and N. Komodakis, “Wide residual networks,” CoRR, vol. abs/1605.07146, 2016.