博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
支持向量机手写数字识别_使用支持向量机进行数字识别
阅读量:2524 次
发布时间:2019-05-11

本文共 10195 字,大约阅读时间需要 33 分钟。

支持向量机手写数字识别

I have been sitting around on the MNIST data set for a while now.  is a large database of handwritten digits and these are provided in the . I have been sitting on this data set for so long in fact, that the last thing I have written for it was last August. I wrote a Python script, that took the training data and created bmp image files of each data point. So you would end up with a folder with 42000 28 by 28 pixel images (about 74.5 MB of memory).  I have uploaded it as a for those interested.

我已经坐了一段时间的MNIST数据集。 是一个庞大的手写数字数据库,在 。 实际上,我一直坐在这个数据集上已经有很长时间了,以至于我为此写的最后一本书是去年8月。 我编写了一个Python脚本,该脚本获取了训练数据并创建了每个数据点的bmp图像文件。 因此,您最终将得到一个包含42000 28 x 28像素图像(大约74.5 MB内存)的文件夹。 我已将其作为上传给了那些有兴趣的人。

 

Digits from MNIST data set

Digits from MNIST data set

MNIST数据集中的数字

But what I have done this weekend, was using the implemented in the module to create a simple model, that determines the digit according to the given pixel data with an accuracy of 84% on the test data in the Kaggle Competition. My implementation is based on on using a SVM to recognize hand written digits.

但是,我本周末所做的工作是使用模块中实现的来创建一个简单的模型,该模型根据给定的像素数据确定位数,其准确度是测试数据的84%。 Kaggle比赛。 我的实现基于使用SVM识别手写数字的 。

What I will present you isn’t the script I have used for the Kaggle submission, but the one I have used on the training data to measure the accuracy of the model. The advantage of using only the training data is, that I have all the correct labels of each data point and can therefore display a and other metrics for evaluating the quality of the model.

我要介绍的不是我用于Kaggle提交的脚本,而是我在训练数据上用来测量模型准确性的脚本。 仅使用训练数据的优点是,我拥有每个数据点的所有正确标签,因此可以显示和其他用于评估模型质量的指标。

Linear Support Vector Machines try to find a hyperplane that separates the training data into two classes with a maximum margin. In our case the class of a data point is the digit it represents. We want to maximize the margin between the hyperplane and the two classes to minimize the error of incorrectly recognizing a digit. The hyperplane then divides the data, so that everything above the hyperplane belongs to one class and everything below the hyperplane belongs to the other class. Each pixel value of the 28 by 28 image is represented in its own dimension, meaning that a image is a point in a space with 28 * 28 = 784 dimensions. And the hyperplane divides the data points into two classes in this 784 dimensional space.

线性支持向量机试图找到一个超平面,该平面将训练数据分为两类,并且具有最大的余量。 在我们的例子中,数据点的类别是它代表的数字。 我们希望最大化超平面和这两个类之间的余量,以最小化错误识别数字的错误。 然后,超平面划分数据,以便超平面上方的所有内容都属于一个类,而超平面下方的所有内容都属于另一类。 28 x 28图像的每个像素值均以其自己的尺寸表示,这意味着图像是28 * 28 = 784尺寸的空间中的一个点。 超平面在这个784维空间中将数据点分为两类。

An additional aspect to consider is, that dividing images into digits between 0 and 9 is a multiclass classification problem. My definition from the previous paragraph on how Support Vector Machines work only contains one hyperplane, that can divide into only two classes. And this truly is a problem, when we have more than two classes like in this case. The solution to this is to train multiple Support Vector Machines, that solve problems stated in this format: “Is this digit a 3 or not a 3?”. Now we are solving a binary classification again with the two classes “is a 3” and “is not a 3”. In our case we have one Support Vector Machine for each digit, giving us a total of ten. We consider the solution with the highest confidence score as the right digit.

要考虑的另一个方面是,将图像分为0到9之间的数字是多类分类问题。 我在上一段中关于支持向量机如何工作的定义仅包含一个超平面,该超平面只能分为两类。 当我们有超过两个这样的类时,这确实是一个问题。 解决方案是训练多台支持向量机,以解决以下格式的问题:“此数字是3还是3?”。 现在,我们再次使用“是3”和“不是3”这两个类来求解二进制分类。 在我们的例子中,每个数字都有一个支持向量机,总共有十个。 我们将置信度最高的解决方案视为正确的数字。

So here is how the “train.csv” looked like to make sense of the indexing in the code:

因此,这就是“ train.csv”看起来如何理解代码中的索引的方式:

label,pixel0,pixel1,pixel3,...,pixel7831 , 0 , 0 , 0 , ... , 04 , 0 , 0 , 0 , ... , 0⋮ , ⋮ , ⋮ , ⋮ , ⋮ , ⋮

The pixel data can take values have in the range [0,255], where 255 is black and 0 is white.

像素数据的取值范围为[0,255],其中255为黑色,0为白色。

And this is how my code looks like:

这就是我的代码的样子:

import csvfrom sklearn import svm, metricsfrom numpy import genfromtxtimport numpy as npdataset = genfromtxt('train.csv', delimiter=",", dtype=np.dtype('>i4'))[1:]labels = [x[0] for x in dataset]data = [x[1:] for x in dataset]n_samples = len(labels)n_features = len(data[0])print("Number of samples: " + str(n_samples) + ", number of features: "+ str(n_features))# a support vector classifierclassifier = svm.LinearSVC()split_point = int(n_samples * 0.66)# using two thirds for training# ans one third for testinglabels_learn = labels[:split_point]data_learn = data[:split_point]labels_test = labels[split_point:]data_test = data[split_point:]print("Training: " + str(len(labels_learn)) + " Test: " + str(len(labels_test)))# Learning Phaseclassifier.fit(data_learn, labels_learn)# Predict Test Setpredicted = classifier.predict(data_test)# classification reportprint("Classification report for classifier %s:n%sn" % (classifier, metrics.classification_report(labels_test, predicted)))# confusion matrixprint("Confusion matrix:n%s" % metrics.confusion_matrix(labels_test, predicted))

The cool thing about metrics is, that you can easily print this as output to judge how well your model is performing:

关于指标的最酷的事情是,您可以轻松地将其作为输出打印,以判断模型的性能如何:

Number of samples: 42000, number of features: 784Training: 27720 Test: 14280Classification report for classifier LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,     intercept_scaling=1, loss='squared_hinge', max_iter=1000,     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,     verbose=0):             precision    recall  f1-score   support          0       0.96      0.93      0.94      1442          1       0.93      0.97      0.95      1613          2       0.92      0.79      0.85      1376          3       0.88      0.86      0.87      1468          4       0.84      0.92      0.88      1339          5       0.67      0.84      0.75      1296          6       0.84      0.96      0.90      1388          7       0.94      0.85      0.89      1504          8       0.81      0.73      0.77      1401          9       0.87      0.78      0.82      1453avg / total       0.87      0.86      0.86     14280Confusion matrix:[[1334    0    8    9    2   36   44    1    8    0] [   0 1570    5    4    2    9    2    2   18    1] [   8   25 1087   33   14   58   77   17   52    5] [   4   14   27 1260    2   89   20   11   32    9] [   4   12    9    2 1230   13   23    2   21   23] [  13    6    5   50   23 1083   48    1   54   13] [   7    5    7    0    5   23 1336    0    4    1] [   6    6   11   10   33   34    2 1274   23  105] [   5   44   12   30   16  213   30    6 1027   18] [   7   10    8   32  138   52    2   41   32 1131]]

The most interesting analysis metric for our digit recognition is probably the confusion matrix. Each row represents one digit and each column entry also represents one digit. A entry in the matrix resembles the number of times a given digit was recognized as the digit represented in the column. So the very first entry 1334 says that 1334 times a digit 0 was recognized as a digit 0. The number 8 two columns to the right means, that 8 times a 0 digit in the training data was recognized as a 2 from our SVM and so on and so fourth. Naturally it makes sense, that for a well working prediction model the entries in the diagonal should be substantially larger than the other values in the given row, which is the case with this linear SVM.

对于我们的数字识别而言,最有趣的分析指标可能是混淆矩阵。 每行代表一位,每列条目也代表一位。 矩阵中的条目类似于将给定数字识别为该列中表示的数字的次数。 因此,第一个条目1334表示将1334次数字0识别为数字0。数字8右边两列表示,从我们的SVM中将训练数据中0位数的8倍识别为2,因此等等。 自然地,对于一个运行良好的预测模型来说,对角线中的项应该比给定行中的其他值大得多,这就是线性SVM的情况。

将SVM应用于我自己的数字 (Applying the SVM to my own digits)

Now since I have a accurate working model for the MNIST handwritten digits, I would also like to see how well the model works for my own digits. So I created a few BMP images using paint.

现在,由于我有一个适用于MNIST手写数字的准确工作模型,因此我也想看看该模型对我自己的数字的工作情况。 因此,我使用绘画创建了一些BMP图像。

number_collage

My digits made in Paint

我的数字用油漆制成

I extract the grey values for each pixel from each BMP image and feed this as test data to predict for my SVM. I use  to load my previously saved classifier, so I don’t have to train my model from scratch every time. Here is the script I have used:

我从每个BMP图像中提取每个像素的灰度值,并将其作为测试数据输入以预测我的SVM。 我使用加载以前保存的分类器,因此不必每次都从头开始训练模型。 这是我使用的脚本:

from PIL import Imageimport numpy as npimport sysfrom sklearn.externals import joblib# argv[1] - path to input imageif len(sys.argv) != 2:    print("Incorrect number of arguments, add a BMP file as cmd line argument.n")    sys.exit()# loading the grey values from the imagecustom_IM = Image.open(sys.argv[1])custom_pixels = list(custom_IM.getdata())corr_pixels = []# convert pixel data to fit training data format (swap grey values)for row in custom_pixels:    new_row = 255 - row[0]    corr_pixels.append(new_row)if len(corr_pixels) != 784:    print("Incorrect Image Dimensions (needs to be 784)n")    sys.exit()# convert to numpy arraytest_set = np.array(corr_pixels)classifier = joblib.load("../classifier/kaggle_digit_recognizer.pkl")# Predict Test Setpredicted = classifier.predict(test_set)# prints the predicted numberprint(predicted)

I have only created one image for each digit and here are my results:

我只为每个数字创建一个图像,这是我的结果:

label, prediction0, 31, 12, 73, 34, 15, 86, 47, 28, 89, 8Accuracy: 40%

So these aren’t very good results at all for my self made images. Since my model performed way better on the actual test set, I guess that the circumstances under which the digits are made play a great matter. It could be, that the brushes I used in Paint don’t resemble the same type of writing for the original data set.

因此,对于我自己制作的图像而言,这些效果根本不是很好。 由于我的模型在实际测试集上的表现要好得多,因此我猜想,制作数字的环境起着至关重要的作用。 我在Paint中使用的画笔可能与原始数据集的书写方式不同。

Overall, support vector machines are a powerful method of prediction and is a widely used machine learning algorithm. But you can also see how bad these simple models perform on differently created images.

总体而言,支持向量机是一种强大的预测方法,并且是一种广泛使用的机器学习算法。 但是,您还可以看到这些简单模型在不同创建的图像上的表现有多么糟糕。

As always, please comment on corrections and suggestions on how to easily improve the code and in this case also the prediction model.

与往常一样,请评论有关如何轻松改进代码的更正和建议,在这种情况下还请提供预测模型。

翻译自:

支持向量机手写数字识别

转载地址:http://qeqwd.baihongyu.com/

你可能感兴趣的文章
七天入门统计力学-第2天 系综与配分函数
查看>>
ubuntu server 10.04 apache2配置多个虚拟主机
查看>>
python标准库xml.etree.ElementTree的bug
查看>>
Tomcat服务器介绍和使用
查看>>
IOS网络方面(异步请求)
查看>>
day6 python学习
查看>>
事务分类
查看>>
《程序是怎样跑起来的》第四章读后感
查看>>
遍历datatable的几种方法(C# )
查看>>
Oracle记录(三) Scott用户的表结构
查看>>
centos静默式安装Oracle11g
查看>>
软件评测师下午题笔记
查看>>
性能测试的概念
查看>>
JavaScript中的函数上下文和apply,call
查看>>
中文排序
查看>>
少数股东损益
查看>>
SecureCRT的安装
查看>>
POJ2635-The Embarrassed Cryptographer
查看>>
css中font-family的中文字体
查看>>
学习笔记:CentOS 7学习之十二:查找命令
查看>>