Week 2

April 10 - April 14

Week 1

Week 3

Download KeyFlare Now!

Showcase

Identifying Text

A video of the code extracting textual data from a screenshot.

Identifying Icons

A video of the code identifying icons found in the image.

Identification and Mouse Movement

A video of the code moving the mouse to locations with both icons and texts and then stop motion reel of places the code identified as having text.

Summary

Over the past week, I have made significant progress in developing KeyFlare, a program that identifies and classifies items on the screen. I have implemented multiple methods to process images, extract textual information, and interactively allow users to choose elements, via libraries such as "pyautogui," "pytesseract," "PIL," "cv2," and "re." The code employs a class-based architecture with a primary class called "Identifier" that performs various operations to identify and classify items on the screen, adhering to the Single Responsibility Principle.

I have written several methods for the Identifier class, including "processingImage," "boxes," "viewBoxes," "data," "viewData," "processingData," "identifyingLocations," and "chooseManually." Key methods within the class handle image preprocessing, text data extraction, bounding box processing, and user interaction, making use of the aforementioned libraries. The "main" function serves as the entry point of the application, creating an instance of the Identifier class and invoking its methods to achieve the desired functionality.

Moreover, I have ensured that the code is well-documented and structured, with clear docstrings and paragraphs explaining the purpose and functionality of each method. This makes my code easier to understand, maintain, and extend in the future. Overall, I have successfully developed a program that can identify, classify, and interact with items on the screen, marking a productive week of progress on this project.

Discussion

The following is a detailed rundown on my code.

#!/usr/bin/env poetry run python3

from datetime import datetime

import system

import numpy as np

import pytesseract

import re

import pyautogui

import cv2

import copy

from PIL import Image

pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

The project relies on a set of dependencies managed by Poetry, a Python dependency manager. The dependencies include standard Python libraries, third-party packages, and a custom module. The standard libraries comprise "datetime," which is used for handling dates and times, and "re," a library for working with regular expressions. The third-party packages include "numpy," a popular numerical computing library; "pytesseract," an Optical Character Recognition (OCR) tool that uses Google's Tesseract OCR engine; "pyautogui," a library for controlling the mouse and keyboard programmatically; "cv2," an alias for OpenCV (Open Source Computer Vision Library), a widely-used library for computer vision and machine learning applications; and "PIL" from the "Pillow" library, a fork of Python Imaging Library (PIL) that adds image opening capabilities to the project.

class Identifier:

"""This class will identify and classify items on the screen."""

imagePath = None

x = system.System(False)

def __init__(self, x=None):

if x != None:

self.x = x

img = self.x.image()

img[1] = self.processingImage(img)

chunks = self.data(img)

processedChunks = self.processingData(chunks)

icons = processedChunks[0]

text = processedChunks[1]

options = self.identifyingLocations(text)

self.chooseManually(options)

# x.openFiles(path=self.imagePath)

The "Identifier" class is designed to identify and select items displayed on the screen. It has an "imagePath" attribute initialized to "None" and a "system.System(False)" object assigned to the attribute "x." The class features a constructor method, which conditionally initializes the object with an optional x parameter. The initializer takes a screenshot, processes the image, extracts and processes chunks of data, and then separates the chunks into icons and text. The following descriptions will cover the image processing, data extraction, data processing, and data selection parts of the project as part of the image pipeline. This pipeline is close to the finished pipeline for actualizing the purpose of this project.

def processingImage(self, image):

"""The processingImage(self, image) method takes a single argument, image, which is expected to be a list with the image file path as its first element."""

cvImage = Image.open(image[0]).convert('RGB')

cvImage = np.array(cvImage)

cvImage = cvImage[:, :, ::-1].copy()

return cvImage

The "processingImage(self, image)" method takes a single argument, "image," which is expected to be a list with the image file path as its first element and pillow image as the second element. The method opens the image using the Pillow library, converts it to the RGB color space, and then converts it into a NumPy array. Finally, it reverses the order of the color channels to transform the image from the RGB (Red Green Blue) format used by Pillow to the BGR (Blue Green Red) format used by OpenCV. Afterwards, the method returns the image in OpenCV's expected format.

The code for extracting and viewing bounding boxes around individual characters was omitted because it was not used in the end.

def data(self, image):

"""

The data follows the format ['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text']

"""

tesstr = pytesseract.image_to_data(

cv2.cvtColor(np.array(image[1]), cv2.COLOR_BGR2GRAY),

lang='eng')

entireList = []

for thing in re.split("\n", tesstr):

subList = []

for item in re.split("\t", thing):

try:

item = int(item)

subList.append(item)

except ValueError:

subList.append(item)

if len(subList) >= 12:

entireList.append(subList)

entireList = entireList[1:]

return entireList

The "data(self, image)" method takes an "image" argument in OpenCV format, converts it to grayscale, and uses the "pytesseract.image_to_data()" function to extract regions of interest and other, unused properties from the image. The extracted data is reformatted into a list of sublists, with each sublist following the format [ 0 'level', 1 'page_num', 2 'block_num', 3 'par_num', 4 'line_num', 5 'word_num', 6 'left', 7 'top', 8 'width', 9 'height', 10 'conf', 11 'text']. The method returns the list of sublists, excluding the header row, providing a structured representation of the data.

def viewData(self, chunks, img):

"""The "viewData(self, chunks, img)" method takes a list of "chunks," representing extracted text data and its properties, and an "img" argument in OpenCV format. The method iterates through the "chunks" and, for each element, prints relevant information such as level, word number, confidence, and text. It then draws a rectangle around the corresponding text element on the image using the "cv2.rectangle()" function. The image is resized and displayed using "cv2.imshow()" and "cv2.waitKey(0)."""

for each in chunks:

print("Level:", each[0], "word num:", each[5],

"conf:", each[10], "text:", each[11])

h = 1080

print(each)

print((each[6], each[6]), (each[6] + each[8], each[7] + each[9]))

cvImage = cv2.rectangle(

img[1], (each[6], each[7]), (each[6] + each[8], each[7] + each[9]), (0, 255, 0), 2)

cvImage = cv2.resize(cvImage, (0, 0), fx=0.75, fy=0.75)

cv2.imshow("screenshot", cvImage)

cv2.waitKey(0)

The "viewData(self, chunks, img)" method takes a list of "chunks," representing extracted text data and its properties, and an "img" argument in OpenCV format. The method iterates through the "chunks" and, for each element, prints relevant information, including level, word number, confidence, and text. It then draws a rectangle around the corresponding text element on the image using the "cv2.rectangle()" function. Since this bounding box is around the same variable, the images are retain the last bounding box. The image is resized and then displayed. This method visualizes the data, as shown in the Showcase.

def processingData(self, chunks):

"""The "processingData(self, chunks)" method takes a list of "chunks," representing extracted text data and its properties, and filters the elements. The method returns a list containing both groups as sublists."""

items = list()

for each in chunks:

if 20 < each[8]*each[9] < 20000 and each[8] < 500 and each[9] < 500:

items.append(each)

lowConfidenceItems = []

highConfidenceItems = []

for each in items:

if each[10] < 1:

lowConfidenceItems.append(each)

elif each[10] >= 1:

highConfidenceItems.append(each)

return [lowConfidenceItems, highConfidenceItems]

The "processingData(self, chunks)" method takes a list of "chunks," representing extracted data (regions of interest) and their properties, and filters the elements based on specified size constraints for eacg bounding box. It filters the elements into two groups: "lowConfidenceItems," containing elements with a confidence score below 1 out of 100 (-1 is included to indicate elements that are completely non-textual), and "highConfidenceItems," containing elements with a confidence score of 1 (out of 100) or higher. The method returns a list with both groups as sublists. This method requires heavy fine-tuning in the future.

def identifyingLocations(self, chunks):

"""The "identifyingLocations(self, chunks)" method takes a list of "chunks," representing extracted text data and its properties, and calculates the central location of each text element based on its bounding box coordinates and dimensions."""

items = []

for each in chunks:

location = (each[6] + each[8]/2, each[7] + each[9]/2)

text = each[11]

items.append([text, location])

return items

The "identifyingLocations(self, chunks)" method takes a list of "chunks" of data, representing extracted data (regions of interest) and its properties, and calculates the central location of each text element based on its (left, top) coordinates and dimensions. For each element, the method creates a list of the text and the calculated central location. The method returns a list of these tuples.

def chooseManually(self, locations):

"""The "chooseManually(self, locations)" method takes a list of "locations," which contains sublists of text data and their corresponding central locations in the input image."""

toPrint = "Choose where to move the mouse from the list below\n"

for i, item in enumerate(locations):

print(item)

toPrint = toPrint + str(i) + ". " + str(item[0]) + "\n"

print(toPrint)

userChoice = input("Which number do you choose? ")

try:

userChoice = int(userChoice)

except ValueError:

print("You did not input a number.")

pyautogui.moveTo(locations[userChoice][1][0],

locations[userChoice][1][1])

The "chooseManually(self, locations)" method takes a list of these "locations," to be called coordinates in the future, which contains sublists of text data and their corresponding central locations in the input image. By prompting the user to select a text element from the list in the terminal, the program automatically uses "pyautogui.moveTo()" to move the mouse cursor to the central location of that element. This method provides a limited interactive way for users to select elements.

def main():

Identifier()

if __name__ == "__main__":

main()

The "if __name__ == '__main__':" conditional statement ensures that the "main()" function is called only when the script is executed directly from the commandline.

Under the system module:

def image(self, show=False, portion=None):

"""Takes an image and returns it as a pillow image"""

if show == False:

imagePath = self.series("screenshots", new=True)

# To use pyautogui on linux, run ```sudo apt-get install scrot```

myScreenshot = pyautogui.screenshot()

print(imagePath)

myScreenshot.save(imagePath)

return [imagePath, Image.open(imagePath)]

if show != False:

pass

The "image(self, show=False, portion=None)" method captures a screenshot using the "pyautogui.screenshot()" function and saves the captured image to a file using the "series" function, which creates a file with a unique number-based series filename within a specified directory. The method returns a list containing the image path and the screenshot as a Pillow Image object. The series function will be made available in future weeks.

Future Tasks

Classifying Icons: Use OpenAI's ImageGPT or another alternative to describe a large number of icons, and then form human-generated categories to train a classifier.
- https://github.com/datadrivendesign/semantic-icon-classifier
- https://openai.com/research/image-gpt
Improve text recognition accuracy: Enhance the text recognition process by incorporating newer, more advanced OCR libraries or applying home-generated machine learning models to improve the accuracy of extracted text data by optimizing for screenshots.
- TensorFlow
- OpenCV
- Google Cloud Vision API
- Amazon Rekognition
Implement a graphical user interface (GUI): Develop a user-friendly GUI to make the application more accessible and easier to use for all users. Also, add shortcuts.
- QT with Python
- EasyGUI
- TKinter
Support multiple languages: Extend the application's text recognition capabilities to handle multiple languages. Right now it can only handle english.
- Chinese
- Hindi
- Gujarati
- etc
Real-time processing: Modify the application to work in real-time, allowing users to interact with dynamic content.
- Creating a while loop that allows for shortcuts, image change detection, and AI interfacing.

Week 1

Week 3

Week 2

Showcase

Identifying Text

Identifying Icons

Identification and Mouse Movement

Table of Contents

Summary

Discussion

Future Tasks

Citations for Tune Ease

Citations for KeyFlare