Week 2
April 10 - April 14
Showcase
Identifying Text
A video of the code extracting textual data from a screenshot.
Identifying Icons
A video of the code identifying icons found in the image.
Identification and Mouse Movement
A video of the code moving the mouse to locations with both icons and texts and then stop motion reel of places the code identified as having text.
Table of Contents
Summary
Over the past week, I have made significant progress in developing KeyFlare, a program that identifies and classifies items on the screen. I have implemented multiple methods to process images, extract textual information, and interactively allow users to choose elements, via libraries such as "pyautogui," "pytesseract," "PIL," "cv2," and "re." The code employs a class-based architecture with a primary class called "Identifier" that performs various operations to identify and classify items on the screen, adhering to the Single Responsibility Principle.
I have written several methods for the Identifier class, including "processingImage," "boxes," "viewBoxes," "data," "viewData," "processingData," "identifyingLocations," and "chooseManually." Key methods within the class handle image preprocessing, text data extraction, bounding box processing, and user interaction, making use of the aforementioned libraries. The "main" function serves as the entry point of the application, creating an instance of the Identifier class and invoking its methods to achieve the desired functionality.
Moreover, I have ensured that the code is well-documented and structured, with clear docstrings and paragraphs explaining the purpose and functionality of each method. This makes my code easier to understand, maintain, and extend in the future. Overall, I have successfully developed a program that can identify, classify, and interact with items on the screen, marking a productive week of progress on this project.
Discussion
The following is a detailed rundown on my code.
#!/usr/bin/env poetry run python3
from datetime import datetime
import system
import numpy as np
import pytesseract
import re
import pyautogui
import cv2
import copy
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
The project relies on a set of dependencies managed by Poetry, a Python dependency manager. The dependencies include standard Python libraries, third-party packages, and a custom module. The standard libraries comprise "datetime," which is used for handling dates and times, and "re," a library for working with regular expressions. The third-party packages include "numpy," a popular numerical computing library; "pytesseract," an Optical Character Recognition (OCR) tool that uses Google's Tesseract OCR engine; "pyautogui," a library for controlling the mouse and keyboard programmatically; "cv2," an alias for OpenCV (Open Source Computer Vision Library), a widely-used library for computer vision and machine learning applications; and "PIL" from the "Pillow" library, a fork of Python Imaging Library (PIL) that adds image opening capabilities to the project.
class Identifier:
   """This class will identify and classify items on the screen."""
   imagePath = None
   x = system.System(False)
   def __init__(self, x=None):
       if x != None:
           self.x = x
       img = self.x.image()
       img[1] = self.processingImage(img)
       chunks = self.data(img)
       processedChunks = self.processingData(chunks)
       icons = processedChunks[0]
       text = processedChunks[1]
       options = self.identifyingLocations(text)
       self.chooseManually(options)
       # x.openFiles(path=self.imagePath)
The "Identifier" class is designed to identify and select items displayed on the screen. It has an "imagePath" attribute initialized to "None" and a "system.System(False)" object assigned to the attribute "x." The class features a constructor method, which conditionally initializes the object with an optional x parameter. The initializer takes a screenshot, processes the image, extracts and processes chunks of data, and then separates the chunks into icons and text. The following descriptions will cover the image processing, data extraction, data processing, and data selection parts of the project as part of the image pipeline. This pipeline is close to the finished pipeline for actualizing the purpose of this project.
   def processingImage(self, image):
       """The processingImage(self, image) method takes a single argument, image, which is expected to be a list with the image file path as its first element."""
       cvImage = Image.open(image[0]).convert('RGB')
       cvImage = np.array(cvImage)
       cvImage = cvImage[:, :, ::-1].copy()
       return cvImage
The "processingImage(self, image)" method takes a single argument, "image," which is expected to be a list with the image file path as its first element and pillow image as the second element. The method opens the image using the Pillow library, converts it to the RGB color space, and then converts it into a NumPy array. Finally, it reverses the order of the color channels to transform the image from the RGB (Red Green Blue) format used by Pillow to the BGR (Blue Green Red) format used by OpenCV. Afterwards, the method returns the image in OpenCV's expected format.
The code for extracting and viewing bounding boxes around individual characters was omitted because it was not used in the end.
   def data(self, image):
       """
       The data follows the format ['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text']
       """
       tesstr = pytesseract.image_to_data(
           cv2.cvtColor(np.array(image[1]), cv2.COLOR_BGR2GRAY),
           lang='eng')
       entireList = []
       for thing in re.split("\n", tesstr):
           subList = []
           for item in re.split("\t", thing):
               try:
                   item = int(item)
                   subList.append(item)
               except ValueError:
                   subList.append(item)
           if len(subList) >= 12:
               entireList.append(subList)
       entireList = entireList[1:]
       return entireList
The "data(self, image)" method takes an "image" argument in OpenCV format, converts it to grayscale, and uses the "pytesseract.image_to_data()" function to extract regions of interest and other, unused properties from the image. The extracted data is reformatted into a list of sublists, with each sublist following the format [ 0 'level', 1 'page_num', 2 'block_num', 3 'par_num', 4 'line_num', 5 'word_num', 6 'left', 7 'top', 8 'width', 9 'height', 10 'conf', 11 'text']. The method returns the list of sublists, excluding the header row, providing a structured representation of the data.
   def viewData(self, chunks, img):
       """The "viewData(self, chunks, img)" method takes a list of "chunks," representing extracted text data and its properties, and an "img" argument in OpenCV format. The method iterates through the "chunks" and, for each element, prints relevant information such as level, word number, confidence, and text. It then draws a rectangle around the corresponding text element on the image using the "cv2.rectangle()" function. The image is resized and displayed using "cv2.imshow()" and "cv2.waitKey(0)."""
       for each in chunks:
           print("Level:", each[0], "word num:", each[5],
                 "conf:", each[10], "text:", each[11])
           h = 1080
           print(each)
           print((each[6], each[6]), (each[6] + each[8], each[7] + each[9]))
           cvImage = cv2.rectangle(
               img[1], (each[6], each[7]), (each[6] + each[8], each[7] + each[9]), (0, 255, 0), 2)
           cvImage = cv2.resize(cvImage, (0, 0), fx=0.75, fy=0.75)
           cv2.imshow("screenshot", cvImage)
           cv2.waitKey(0)
The "viewData(self, chunks, img)" method takes a list of "chunks," representing extracted text data and its properties, and an "img" argument in OpenCV format. The method iterates through the "chunks" and, for each element, prints relevant information, including level, word number, confidence, and text. It then draws a rectangle around the corresponding text element on the image using the "cv2.rectangle()" function. Since this bounding box is around the same variable, the images are retain the last bounding box. The image is resized and then displayed. This method visualizes the data, as shown in the Showcase.
   def processingData(self, chunks):
       """The "processingData(self, chunks)" method takes a list of "chunks," representing extracted text data and its properties, and filters the elements. The method returns a list containing both groups as sublists."""
       items = list()
       for each in chunks:
           if 20 < each[8]*each[9] < 20000 and each[8] < 500 and each[9] < 500:
               items.append(each)
       lowConfidenceItems = []
       highConfidenceItems = []
       for each in items:
           if each[10] < 1:
               lowConfidenceItems.append(each)
           elif each[10] >= 1:
               highConfidenceItems.append(each)
       return [lowConfidenceItems, highConfidenceItems]
The "processingData(self, chunks)" method takes a list of "chunks," representing extracted data (regions of interest) and their properties, and filters the elements based on specified size constraints for eacg bounding box. It filters the elements into two groups: "lowConfidenceItems," containing elements with a confidence score below 1 out of 100 (-1 is included to indicate elements that are completely non-textual), and "highConfidenceItems," containing elements with a confidence score of 1 (out of 100) or higher. The method returns a list with both groups as sublists. This method requires heavy fine-tuning in the future.
   def identifyingLocations(self, chunks):
       """The "identifyingLocations(self, chunks)" method takes a list of "chunks," representing extracted text data and its properties, and calculates the central location of each text element based on its bounding box coordinates and dimensions."""
       items = []
       for each in chunks:
           location = (each[6] + each[8]/2, each[7] + each[9]/2)
           text = each[11]
           items.append([text, location])
       return items
The "identifyingLocations(self, chunks)" method takes a list of "chunks" of data, representing extracted data (regions of interest) and its properties, and calculates the central location of each text element based on its (left, top) coordinates and dimensions. For each element, the method creates a list of the text and the calculated central location. The method returns a list of these tuples.
   def chooseManually(self, locations):
       """The "chooseManually(self, locations)" method takes a list of "locations," which contains sublists of text data and their corresponding central locations in the input image."""
       toPrint = "Choose where to move the mouse from the list below\n"
       for i, item in enumerate(locations):
           print(item)
           toPrint = toPrint + str(i) + ". " + str(item[0]) + "\n"
       print(toPrint)
       userChoice = input("Which number do you choose? ")
       try:
           userChoice = int(userChoice)
       except ValueError:
           print("You did not input a number.")
       pyautogui.moveTo(locations[userChoice][1][0],
                        locations[userChoice][1][1])
The "chooseManually(self, locations)" method takes a list of these "locations," to be called coordinates in the future, which contains sublists of text data and their corresponding central locations in the input image. By prompting the user to select a text element from the list in the terminal, the program automatically uses "pyautogui.moveTo()" to move the mouse cursor to the central location of that element. This method provides a limited interactive way for users to select elements.
def main():
   Identifier()
if __name__ == "__main__":
   main()
The "if __name__ == '__main__':" conditional statement ensures that the "main()" function is called only when the script is executed directly from the commandline.
Under the system module:
   def image(self, show=False, portion=None):
       """Takes an image and returns it as a pillow image"""
       if show == False:
           imagePath = self.series("screenshots", new=True)
           # To use pyautogui on linux, run ```sudo apt-get install scrot```
           myScreenshot = pyautogui.screenshot()
           print(imagePath)
           myScreenshot.save(imagePath)
           return [imagePath, Image.open(imagePath)]
       if show != False:
           pass
The "image(self, show=False, portion=None)" method captures a screenshot using the "pyautogui.screenshot()" function and saves the captured image to a file using the "series" function, which creates a file with a unique number-based series filename within a specified directory. The method returns a list containing the image path and the screenshot as a Pillow Image object. The series function will be made available in future weeks.
Future Tasks
Classifying Icons: Use OpenAI's ImageGPT or another alternative to describe a large number of icons, and then form human-generated categories to train a classifier.
https://github.com/datadrivendesign/semantic-icon-classifier
https://openai.com/research/image-gpt
Improve text recognition accuracy: Enhance the text recognition process by incorporating newer, more advanced OCR libraries or applying home-generated machine learning models to improve the accuracy of extracted text data by optimizing for screenshots.
TensorFlow
OpenCV
Google Cloud Vision API
Amazon Rekognition
Implement a graphical user interface (GUI): Develop a user-friendly GUI to make the application more accessible and easier to use for all users. Also, add shortcuts.
QT with Python
EasyGUI
TKinter
Support multiple languages: Extend the application's text recognition capabilities to handle multiple languages. Right now it can only handle english.
Chinese
Hindi
Gujarati
etc
Real-time processing: Modify the application to work in real-time, allowing users to interact with dynamic content.
Creating a while loop that allows for shortcuts, image change detection, and AI interfacing.