question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal for near-real-time image element interaction

See original GitHub issue

Context

I’ve been working on some projects recently that highlight the power of the new image matching functionality in Appium. I think we can take this to the next level through a series of improvements. I wanted to lay out the proposal here before beginning work since it would be several different pieces and a fair amount of code.

Basically, the goal is very quick recognition and interaction with screen positions based on image matching. The pieces for this are already present in Appium, but they are too slow. The process right now looks like:

  1. Take a screenshot (costs a client-server roundtrip, and whatever time Appium’s screenshot dumping methods take)
  2. Send the screenshot together with a reference image template back to the Appium server, to get match details like x/y coordinates (costs a client-server roundtrip, and whatever time Appium’s opencv matching methods take)
  3. Tap the screen based on the coordinates received (costs a client-server roundtrip, and whatever time Appium’s tap methods take)

All of these steps take so much time that matching and interacting with image areas on screen will not be reliable unless the screen is completely static. I think we can open up greater reliability and the possibility of completely new applications (for example, game testing) with a faster process. Looking at the current process, there are a number of potential areas for speedup:

  1. 3 client-server roundtrips over HTTP (potentially in high-latency contexts like cloud testing)
  2. Screenshot time
  3. OpenCV matching time
  4. Tap screen time

I’m going to ignore (3) and (4) above since I don’t see any easy way to improve them. (2) is already being addressed by https://github.com/appium/appium-support/pull/75 and related PRs, for those who have access to an mjpeg screenshot stream (currently thinking about how to make this technology available via Appium for all users as well).

That leaves the 3 client-server roundtrips as the primary bottleneck, which is a big bottleneck indeed when running in a cloud environment. I think we can eliminate all of them.

Proposal

1. Find Element By Image

I have already implemented (in the Python and WD.js clients), an element-like interface for image template matches. In my opinion, this is the most natural and useful way of using Appium’s image matching feature. From the perspective of the user, they supply an image template to a “findElementByImage” command, and they get back an element object in their script, just as if they had tried to find an element by xpath or anything else. The only difference is that the element object is of a different class (ImageElement, say), and has only a subset of WebElement’s methods–click, getSize, getLocation, getRect, and isVisible.

Right now this logic is being duplicated in each client. Instead, I propose to move this logic to the Appium server, specifically in BaseDriver. Basically, from the client perspective, we would have simply a new locator strategy (-image), and the selector passed in would be the base64-encoded template image (with one wrinkle; see below). The response from the server would be the same JSON response as for a WebElement, and from the client perspective, it can be wrapped up into a WebElement object.

To support this, the server side would be implemented as follows:

  1. New code added to BaseDriver to trap calls to find and check if the strategy is -image
  2. If so, screenshot retrieved and image matching methods called immediately
  3. In case of a match, a unique image element ID is created and stored in a hash together with the match position location
  4. In case of no match, NoSuchElementException returned
  5. When the client calls an element method using an image element ID, the server would check for this situation in execute before passing control onto the command methods. If we’re operating on an image element, we check that the command is in the list of available image element commands.
  6. If so, we perform the action (tap etc…) in BaseDriver itself
  7. If not, we return NotImplementedException

This procedure solves two main problems:

  1. Eliminates one of the server-client roundtrips (now no need to get screenshot and do the image match separately—the server does them both in the same request)
  2. Reduces the amount of work required to support this feature (it’s in the server in one place, rather than requiring many client implementations).

The wrinkle in this proposal so far is that matching image templates doesn’t just require the image to be matched–it also takes a threshold parameter, a match below which would be rejected. We could simply set this at a default value and hope it’s good enough for all users. Better would be to have a default value but allow users to set the match threshold in their client code, for example:

driver.elementByImage(b64Data, 0.5);

The problem here is that the find element commands take a strategy and a selector, but we want a third parameter. Rather than adjust the API, I would propose to encode the threshold in the selector itself, something like this:

// before the client sends the selector to the server
let selector = `threshold=0.5|${b64Data}`;

In other words, add a bit of preamble to the base 64 string which is easy to parse and strip out.

2. Stale Image Elements

One problem with image matching and tapping via coordinate in general is that there’s no guarantee the coordinates of a matched image on the screen still represent the correct underlying element—who’s to say it didn’t change in between the time we get an element and act on it?

The solution to this is to re-match the template and assert that the coordinates are the same, before attempting to do an action. I.e., if a user calls imageElement.click() in their client code, then the server would re-match the image template and assert that its coordinates match the previously-retrieved coordinates for that image element, before performing the tap. If there’s no match or the coordinates differ, a StaleElementException is returned instead.

How does the server know how to re-match the image template? Well, the base64 image template data string would have to be stored as part of the image element hash. Since image data can get large, we would probably want to have some sort of ejection policy for the hash, maybe making it an LRU cache bounded by total memory usage or something.

This also points out a nagging issue, not just with finding and acting on image elements, but finding and acting on any element: the chance that, between the time an element is found and the time it is interacted with, the situation in the app might have changed. This is what StaleElementExceptions are designed to notify us about. But what if we could greatly reduce the chance of such an exception, and speed up our tests in the process? We can!

3. findAndAct

I propose we bring back a feature we had in the old bad days of Apple’s UIAutomation driver, when we ran into this problem very frequently of elements being invalidated and unavailable for actions, just in the time it took for a client-server roundtrip. The primary motivation here is to help with extremely low-latency test applications (dynamic image matching), but the benefit extends to interacting with elements as a whole.

Basically, we add a new endpoint which encapsulates both finding and acting. It would take a strategy, selector, command name, and any command parameters. As a first pass, it could look like:

# endpoint
POST /session/:sessionid/appium/find_and_act

# request JSON
{
  "using": "xpath",
  "value": "//foo/bar",
  "command": "sendKeys",
  "params": {
    "text": "hello world"
  }
}

# response is the response for the action--no element is returned
# however, if an element is not found, NoSuchElementException
# would be returned

Since we want to act immediately on a found element, this only makes sense in the singular (no findElements support is necessary).

Putting all of these changes together, we have the potential for an extremely fast and reliable way of interacting with apps based on image matching, with side benefits for speeding up tests in other regards as well. We can reduce finding and tapping on a matched image to one client-server roundtrip, with the least possible chance of any slippage of app state (between finding a matched image coordinate set and tapping on it there is only the time that the tap itself takes, which should be in low ms).

I’d be happy to implement this proposal, but because it’s pretty large and cross-cutting I wanted to get conceptual buy-in from the other maintainers first and solicit other ideas or any issues you see with it.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:11
  • Comments:18 (13 by maintainers)

github_iconTop GitHub Comments

3reactions
jlippscommented, Jul 13, 2018

By the way, if you’re all cool with P1 (image element finding on the server side), I can start implementing that while we debate the whole batched commands proposal.

1reaction
jlippscommented, Jun 2, 2019

I spent some time working on the batch commands problem this weekend, and might have something to show soon. It’s kind of a different idea, will see what you think of the PR 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

WO2015130366A2 - Systems and methods for high-contrast, near ...
A terahertz image beam is upconverted by a nonlinear optical process ... and methods for high-contrast, near-real-time acquisition of terahertz images ...
Read more >
RePORT RePORTER - National Institutes of Health (NIH)
In the proposed research we plan to apply an interactive comprehensive MR image- guided program for the diagnosis and treatment of localized prostate...
Read more >
Combined Imagery Product Spec FINAL | May 2022
Table 2-A: PlanetScope Satellite Image Product Processing Levels ... Web based interfaces enable users to interact with Planet's imagery ...
Read more >
Near real-time Evolution-based Adaptation Strategy for dynamic ...
Color Quantization of sequences of images becomes a Non-stationary Clustering Problem. In this paper we propose a very simple and effective Evolution-based ...
Read more >
Image accessible name is descriptive[proposed] | ACT Rule
Exception: Elements that lose focus during a period of up to 1 second after gaining focus, without the user interacting with the page...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found