Sign language is one of the oldest and most natural form of language for communication, but since most people do not know sign language and interpreters are very difficult to come by I have come up with a real time method using neural networks for fingerspelling based american sign language.
In this method, the hand is first passed through a filter and after the filter is applied the hand is passed through a classifier which predicts the class of the hand gestures. This method provides 90.00 % accuracy for the 26 letters of the alphabet.
American sign language is a predominant sign language since the only disability Deaf and Dumb (hereby referred to as D&M) people have is communication related and since they cannot use spoken languages, the only way for them to communicate is through sign language. Communication is the process of exchange of thoughts and messages in various ways such as speech, signals, behavior and visuals. D&M people make use of their hands to express different gestures to express their ideas with other people. Gestures are the non-verbally exchanged messages and these gestures are understood with vision. This nonverbal communication of deaf and dumb people is called sign language. A sign language is a language which uses gestures instead of sound to convey meaning combining hand-shapes, orientation and movement of the hands, arms or body, facial expressions and lip-patterns. Contrary to popular belief, sign language is not international. These vary from region to region. Sign language is a visual language and consists of 3 major components:
In this project I basically focus on producing a model which can recognize Fingerspelling based hand gestures in order to form a complete word by combining each gesture.
The gestures I trained are as given in the image below.
Create the folder with name Data in that folder there will be sub Folder with name A to Z for each alphabet
# Specify dataset path dynamically
dataset_path = "Data"
class_name = input("Enter the sign label (e.g., A, B, C): ").upper()
folder = os.path.join(dataset_path, class_name)
# Create folder if it doesn't exist
if not os.path.exists(folder):
os.makedirs(folder)
Crop the detected hand region with some padding (offset)
imgCrop = img[max(0, y - offset):min(y + h + offset, img.shape[0]),
max(0, x - offset):min(x + w + offset, img.shape[1])]
Calculate aspect ratio of the cropped image
aspectRatio = h / w
# If height is greater than width (portrait orientation)
if aspectRatio > 1:
k = imgSize / h
wCal = math.ceil(k * w)
imgResize = cv2.resize(imgCrop, (wCal, imgSize))
wGap = (imgSize - wCal) // 2
imgWhite[:, wGap:wGap + wCal] = imgResize
else: # If width is greater than height (landscape orientation)
k = imgSize / w
hCal = math.ceil(k * h)
imgResize = cv2.resize(imgCrop, (imgSize, hCal))
hGap = (imgSize - hCal) // 2
imgWhite[hGap:hGap + hCal, :] = imgResize
Original webcam Image display and capture keyboard input
# Show the original webcam image
cv2.imshow("Image", img)
# Capture keyboard input
key = cv2.waitKey(1)
# If 's' is pressed, save the processed image
if key == ord("s"):
counter += 1
cv2.imwrite(f'{folder}/Image_{time.time()}.jpg', imgWhite)
print(counter)
# If 'q' is pressed, exit the loop
elif key == ord("q"):
break
I captured each frame shown by the webcam of our machine.
In each frame I defined a region of interest (ROI) which is denoted by a blue bounded square as shown in the image below.
After capturing the image from the ROI,
The image after capturing ROI look like below.
4.After the creation of the training and testing data. The third step is of creating a model for training. Here, I have used Convolutional Neural Network(CNN) for building this model. The model summary is as following
Unlike regular Neural Networks, in the layers of CNN, the neurons are arranged in 3 dimensions: width, height, depth.
The neurons in a layer will only be connected to a small region of the layer (window size) before it, instead of all of the neurons in a fully-connected manner.
Moreover, the final output layer would have dimensions(number of classes), because by the end of the CNN architecture we will reduce the full image into a single vector of class scores.
In convolution layer I have taken a small window size [typically of length 5*5] that extends to the depth of the input matrix.
The layer consists of learnable filters of window size. During every iteration I slid the window by stride size [typically 1], and compute the dot product of filter entries and input values at a given position.
As I continue this process well create a 2-Dimensional activation matrix that gives the response of that matrix at every spatial position.
That is, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some colour.
We use pooling layer to decrease the size of activation matrix and ultimately reduce the learnable parameters.
There are two types of pooling:
In max pooling we take a window size [for example window of size 2*2], and only taken the maximum of 4 values.
Well lid this window and continue this process, so well finally get an activation matrix half of its original Size.
In average pooling we take average of all Values in a window.
In convolution layer neurons are connected only to a local region, while in a fully connected region, well connect the all the inputs to neurons.
# Load the model separately
model = load_model("Model/keras_model.h5", compile=False)
cap = cv2.VideoCapture(0)
detector = HandDetector(maxHands=1)
classifier = Classifier("Model/keras_model.h5","Model/labels.txt")
offset = 20
imgSize = 300
labels = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q"
,"R","S","T","U","V","W","X","Y","Z"] #label for Ato Z
prediction code that will show the results
prediction, index =classifier.getPrediction(imgWhite,draw=False)
print(prediction,index)
The bounding box around the hand and the predicted label are updated in real-time as the user performs different hand gestures.
cv2.rectangle(imgOutput,(x-offset,y-offset-50),(x-offset+100,y-offset-50+50),(255,0,255),4,cv2.FILLED)
cv2.putText(imgOutput,labels[index],(x,y-26),cv2.FONT_HERSHEY_COMPLEX,2,(255,255,255),2)
cv2.rectangle(imgOutput,(x-offset,y-offset),(x+w+offset,y+h+offset),(255,0,255),4)
# cv2.imshow("ImageCrop",imgCrop)
# cv2.imshow("ImageWhite",imgWhite) # if you want img crop and img white window then un comment this
Note : Python 3.8 or above is required to build this project, as some of the libraries required can't be installed on the lastest version of the Python
1. Lastest pip -> pip install --upgrade pip
2. numpy -> pip install numpy
3. opencv -> pip install opencv-python
4. pip install tensorflow==2.12.0
5. keras -> pip install keras==2.12.0
6. cvzone -> pip install czvone
7. mediapipe -> pip install mediapipe==0.10.18
python /path/to/the/resultPredict.py