Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal for near-real-time image element interaction

See original GitHub issue

Context

I’ve been working on some projects recently that highlight the power of the new image matching functionality in Appium. I think we can take this to the next level through a series of improvements. I wanted to lay out the proposal here before beginning work since it would be several different pieces and a fair amount of code.

Basically, the goal is very quick recognition and interaction with screen positions based on image matching. The pieces for this are already present in Appium, but they are too slow. The process right now looks like:

Take a screenshot (costs a client-server roundtrip, and whatever time Appium’s screenshot dumping methods take)
Send the screenshot together with a reference image template back to the Appium server, to get match details like x/y coordinates (costs a client-server roundtrip, and whatever time Appium’s opencv matching methods take)
Tap the screen based on the coordinates received (costs a client-server roundtrip, and whatever time Appium’s tap methods take)

All of these steps take so much time that matching and interacting with image areas on screen will not be reliable unless the screen is completely static. I think we can open up greater reliability and the possibility of completely new applications (for example, game testing) with a faster process. Looking at the current process, there are a number of potential areas for speedup:

3 client-server roundtrips over HTTP (potentially in high-latency contexts like cloud testing)
Screenshot time
OpenCV matching time
Tap screen time

I’m going to ignore (3) and (4) above since I don’t see any easy way to improve them. (2) is already being addressed by https://github.com/appium/appium-support/pull/75 and related PRs, for those who have access to an mjpeg screenshot stream (currently thinking about how to make this technology available via Appium for all users as well).

That leaves the 3 client-server roundtrips as the primary bottleneck, which is a big bottleneck indeed when running in a cloud environment. I think we can eliminate all of them.

Proposal

1. Find Element By Image

I have already implemented (in the Python and WD.js clients), an element-like interface for image template matches. In my opinion, this is the most natural and useful way of using Appium’s image matching feature. From the perspective of the user, they supply an image template to a “findElementByImage” command, and they get back an element object in their script, just as if they had tried to find an element by xpath or anything else. The only difference is that the element object is of a different class (ImageElement, say), and has only a subset of WebElement’s methods–click, getSize, getLocation, getRect, and isVisible.

Right now this logic is being duplicated in each client. Instead, I propose to move this logic to the Appium server, specifically in BaseDriver. Basically, from the client perspective, we would have simply a new locator strategy (-image), and the selector passed in would be the base64-encoded template image (with one wrinkle; see below). The response from the server would be the same JSON response as for a WebElement, and from the client perspective, it can be wrapped up into a WebElement object.

To support this, the server side would be implemented as follows:

New code added to BaseDriver to trap calls to find and check if the strategy is -image
If so, screenshot retrieved and image matching methods called immediately
In case of a match, a unique image element ID is created and stored in a hash together with the match position location
In case of no match, NoSuchElementException returned
When the client calls an element method using an image element ID, the server would check for this situation in execute before passing control onto the command methods. If we’re operating on an image element, we check that the command is in the list of available image element commands.
If so, we perform the action (tap etc…) in BaseDriver itself
If not, we return NotImplementedException

This procedure solves two main problems:

Eliminates one of the server-client roundtrips (now no need to get screenshot and do the image match separately—the server does them both in the same request)
Reduces the amount of work required to support this feature (it’s in the server in one place, rather than requiring many client implementations).

The wrinkle in this proposal so far is that matching image templates doesn’t just require the image to be matched–it also takes a threshold parameter, a match below which would be rejected. We could simply set this at a default value and hope it’s good enough for all users. Better would be to have a default value but allow users to set the match threshold in their client code, for example:

driver.elementByImage(b64Data, 0.5);

The problem here is that the find element commands take a strategy and a selector, but we want a third parameter. Rather than adjust the API, I would propose to encode the threshold in the selector itself, something like this:

// before the client sends the selector to the server
let selector = `threshold=0.5|${b64Data}`;

In other words, add a bit of preamble to the base 64 string which is easy to parse and strip out.

2. Stale Image Elements

One problem with image matching and tapping via coordinate in general is that there’s no guarantee the coordinates of a matched image on the screen still represent the correct underlying element—who’s to say it didn’t change in between the time we get an element and act on it?

The solution to this is to re-match the template and assert that the coordinates are the same, before attempting to do an action. I.e., if a user calls imageElement.click() in their client code, then the server would re-match the image template and assert that its coordinates match the previously-retrieved coordinates for that image element, before performing the tap. If there’s no match or the coordinates differ, a StaleElementException is returned instead.

How does the server know how to re-match the image template? Well, the base64 image template data string would have to be stored as part of the image element hash. Since image data can get large, we would probably want to have some sort of ejection policy for the hash, maybe making it an LRU cache bounded by total memory usage or something.

This also points out a nagging issue, not just with finding and acting on image elements, but finding and acting on any element: the chance that, between the time an element is found and the time it is interacted with, the situation in the app might have changed. This is what StaleElementExceptions are designed to notify us about. But what if we could greatly reduce the chance of such an exception, and speed up our tests in the process? We can!

3. findAndAct

I propose we bring back a feature we had in the old bad days of Apple’s UIAutomation driver, when we ran into this problem very frequently of elements being invalidated and unavailable for actions, just in the time it took for a client-server roundtrip. The primary motivation here is to help with extremely low-latency test applications (dynamic image matching), but the benefit extends to interacting with elements as a whole.

Basically, we add a new endpoint which encapsulates both finding and acting. It would take a strategy, selector, command name, and any command parameters. As a first pass, it could look like:

# endpoint
POST /session/:sessionid/appium/find_and_act

# request JSON
{
  "using": "xpath",
  "value": "//foo/bar",
  "command": "sendKeys",
  "params": {
    "text": "hello world"
  }
}

# response is the response for the action--no element is returned
# however, if an element is not found, NoSuchElementException
# would be returned

Since we want to act immediately on a found element, this only makes sense in the singular (no findElements support is necessary).

Putting all of these changes together, we have the potential for an extremely fast and reliable way of interacting with apps based on image matching, with side benefits for speeding up tests in other regards as well. We can reduce finding and tapping on a matched image to one client-server roundtrip, with the least possible chance of any slippage of app state (between finding a matched image coordinate set and tapping on it there is only the time that the tap itself takes, which should be in low ms).

I’d be happy to implement this proposal, but because it’s pretty large and cross-cutting I wanted to get conceptual buy-in from the other maintainers first and solicit other ideas or any issues you see with it.

Issue Analytics

State:
Created 5 years ago
Reactions:11
Comments:18 (13 by maintainers)

Top GitHub Comments

3reactions

jlippscommented, Jul 13, 2018

By the way, if you’re all cool with P1 (image element finding on the server side), I can start implementing that while we debate the whole batched commands proposal.

1reaction

jlippscommented, Jun 2, 2019

I spent some time working on the batch commands problem this weekend, and might have something to show soon. It’s kind of a different idea, will see what you think of the PR 😃

Top Results From Across the Web

WO2015130366A2 - Systems and methods for high-contrast, near ...

A terahertz image beam is upconverted by a nonlinear optical process ... and methods for high-contrast, near-real-time acquisition of terahertz images ...

RePORT RePORTER - National Institutes of Health (NIH)

In the proposed research we plan to apply an interactive comprehensive MR image- guided program for the diagnosis and treatment of localized prostate...

Combined Imagery Product Spec FINAL | May 2022

Table 2-A: PlanetScope Satellite Image Product Processing Levels ... Web based interfaces enable users to interact with Planet's imagery ...

Near real-time Evolution-based Adaptation Strategy for dynamic ...

Color Quantization of sequences of images becomes a Non-stationary Clustering Problem. In this paper we propose a very simple and effective Evolution-based ...

Image accessible name is descriptive[proposed] | ACT Rule

Exception: Elements that lose focus during a period of up to 1 second after gaining focus, without the user interacting with the page...