Recognizing Handwritten Digits

7 min readJun 13, 2021

Here for Recognizing Handwritten Digits we will use the python as a progamming language and also use some library for doing this recognition

Handwriting Recognition

Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. the other applications is the OCR (Optical Character Recognition) technology. OCR software is used to read the handwritten text, or pages of printed books, for general electronic documents in which each character is well defined.

But the problem of handwriting recognition goes farther back in time, more precisely to the early 20th Century (1920s), when Emanuel Goldberg (1881–1970) began his studies regarding this issue and suggested that a statistical approach would be an optimal choice.

To address this issue, the scikit-learn in the python library provides a good example to better understand this technique, the issues involved, and the possibility of making predictions.

Here we use the digits dataset which already provided in the Sklearn datasets library.

Steps

Importing the necessary library and loading the datasets.
Basic analysis of the Digit Dataset
Spliting the data for Testing and Training
Defining and Fiting the svm (support vector mechine algorithm) and Logistic Regression model
Getting prediction result from both model.
Finding the score of the both model and making a confusion matrix of the logestic regression model for further analysis.

1. Importing the necessary library and loading the datasets.

Before starting anything, make-sure numpy, pandas, matplotlib, scikit learn (sklearn) are installed on your computer. If they are not installed in the pc, it can be installed using pip installer in the terminal or command prompt.

We need to import all the modules that we are going to need for training our model. The sklearn.datasets library already contains the digit datasets in which we are working for handwriting digit recognition. So we can easily import the dataset through sklearn. The load_digits() method returns the Data and the targets (labels) which we use to make our training and testing datasets. the data.DESCR is used to get a brief description about the digits dataset.

Importing the necessary library and also printing the description about the dataset

2. Basic analysis of the Digit Dataset.

After loading the dataset we will know what is our targets and the data, we will find whats the size and shape of the data in the digit datasets and same things are done for the target. And we will plot some of the data using the subplot function present in the matplotlib library.

Finding the size of the target and target_name columns in the dataset.

Ploting the some of the digit data using the subplot function of matplotlib.

outputs of the ploting of digit data using the matplotlib.

3. Spliting the data for Testing and Training

Now for spliting the data into training and testing, we will use the train_test_split() function is used which is present in sklearn.model_selection library. the function take the dataset and split the dataset for training and testing. here the advantage of this function is that every time it run it will randomize the dataset first and then splitting of dataset happens. if we dont want to randomised the data for spliting we use argument random_state which help to split the dataset same in every time it runs. for varying the size of testing data the test_size function is used which helps us to specify that how much data we want to keep in the testing dataset.

Spliting the digit dataset into training and testing dataset

4. Defining and Fiting the svm (support vector mechine algorithm) and Logistic Regression model.

Before Defining the model first we have to know about what is Support mechine model and logistic regression model.

What is Support vector mechine algorithm (svm)?

In Brief the Support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

Hyperplanes and Support Vectors

Hyperplanes are the decision boundaries that help to classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

Support vector mechine SVM module is already present in the sklearn library and for using the algorithm we have to imported it from the sklearn library. and also fitting the training data to the svc algorithm.

What is Logistic Regression ?

Logistic Regression is a regression analysis to conduct when the dependent variable is dichotomous (binary). It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

This is the equation of the logistic regression

The logistic regression model is already present in the sklearn and for using the model we have to import the LogisticRegression() function from sklearn.linear_model and also fitting the tarining data to the logistic regression model.

5. Getting prediction result from both model.

For predicting the svc algorithm test dataset is given to the svc function for getting predictions.

predicting the test dataset using the svc algorithm

For determing how much percentage of data we have correctly predicted is done using this code.

Gives the percentage that we have corrected predicted the test data

Now we have given the test dataset to the logistic regression model for getting the prediction of the result from the test dataset.

6. Finding the score of the both model and making a confusion matrix of the logestic regression model for further analysis.

we already find the score or accuracy of svc algorithm in the previous step. Here in this step we will find the score of our logistic regression model and we will make a confusion matrix of the logistic to understand the data better.

The <model_name>.score() function is used to find the score/ accuracy of the model. it basically tells that how much data we have correctly predicted when we use the the test dataset.

After finding the score of the logistic regression model we made a confusition matrix and made a heatmat of the confustion matrix using the seaborn.heatmap() function which is present in the seaborn library. it is useful to understand the model as well as the data.

Resources / References:

Thank you

I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Exprience. Thank you www.suvenconsultants.com

And I also gives thanks to the user for reading my article. check this github link for getting the full code https://github.com/codebyabhishek772/hand_writing_recognition