Proposal for near-real-time image element interaction
See original GitHub issueContext
I’ve been working on some projects recently that highlight the power of the new image matching functionality in Appium. I think we can take this to the next level through a series of improvements. I wanted to lay out the proposal here before beginning work since it would be several different pieces and a fair amount of code.
Basically, the goal is very quick recognition and interaction with screen positions based on image matching. The pieces for this are already present in Appium, but they are too slow. The process right now looks like:
- Take a screenshot (costs a client-server roundtrip, and whatever time Appium’s screenshot dumping methods take)
- Send the screenshot together with a reference image template back to the Appium server, to get match details like x/y coordinates (costs a client-server roundtrip, and whatever time Appium’s opencv matching methods take)
- Tap the screen based on the coordinates received (costs a client-server roundtrip, and whatever time Appium’s tap methods take)
All of these steps take so much time that matching and interacting with image areas on screen will not be reliable unless the screen is completely static. I think we can open up greater reliability and the possibility of completely new applications (for example, game testing) with a faster process. Looking at the current process, there are a number of potential areas for speedup:
- 3 client-server roundtrips over HTTP (potentially in high-latency contexts like cloud testing)
- Screenshot time
- OpenCV matching time
- Tap screen time
I’m going to ignore (3) and (4) above since I don’t see any easy way to improve them. (2) is already being addressed by https://github.com/appium/appium-support/pull/75 and related PRs, for those who have access to an mjpeg screenshot stream (currently thinking about how to make this technology available via Appium for all users as well).
That leaves the 3 client-server roundtrips as the primary bottleneck, which is a big bottleneck indeed when running in a cloud environment. I think we can eliminate all of them.
Proposal
1. Find Element By Image
I have already implemented (in the Python and WD.js clients), an element-like interface for image template matches. In my opinion, this is the most natural and useful way of using Appium’s image matching feature. From the perspective of the user, they supply an image template to a “findElementByImage” command, and they get back an element object in their script, just as if they had tried to find an element by xpath or anything else. The only difference is that the element object is of a different class (ImageElement
, say), and has only a subset of WebElement
’s methods–click
, getSize
, getLocation
, getRect
, and isVisible
.
Right now this logic is being duplicated in each client. Instead, I propose to move this logic to the Appium server, specifically in BaseDriver. Basically, from the client perspective, we would have simply a new locator strategy (-image
), and the selector passed in would be the base64-encoded template image (with one wrinkle; see below). The response from the server would be the same JSON response as for a WebElement
, and from the client perspective, it can be wrapped up into a WebElement
object.
To support this, the server side would be implemented as follows:
- New code added to BaseDriver to trap calls to
find
and check if the strategy is-image
- If so, screenshot retrieved and image matching methods called immediately
- In case of a match, a unique image element ID is created and stored in a hash together with the match position location
- In case of no match,
NoSuchElementException
returned - When the client calls an element method using an image element ID, the server would check for this situation in
execute
before passing control onto the command methods. If we’re operating on an image element, we check that the command is in the list of available image element commands. - If so, we perform the action (tap etc…) in BaseDriver itself
- If not, we return
NotImplementedException
This procedure solves two main problems:
- Eliminates one of the server-client roundtrips (now no need to get screenshot and do the image match separately—the server does them both in the same request)
- Reduces the amount of work required to support this feature (it’s in the server in one place, rather than requiring many client implementations).
The wrinkle in this proposal so far is that matching image templates doesn’t just require the image to be matched–it also takes a threshold parameter, a match below which would be rejected. We could simply set this at a default value and hope it’s good enough for all users. Better would be to have a default value but allow users to set the match threshold in their client code, for example:
driver.elementByImage(b64Data, 0.5);
The problem here is that the find element commands take a strategy and a selector, but we want a third parameter. Rather than adjust the API, I would propose to encode the threshold in the selector itself, something like this:
// before the client sends the selector to the server
let selector = `threshold=0.5|${b64Data}`;
In other words, add a bit of preamble to the base 64 string which is easy to parse and strip out.
2. Stale Image Elements
One problem with image matching and tapping via coordinate in general is that there’s no guarantee the coordinates of a matched image on the screen still represent the correct underlying element—who’s to say it didn’t change in between the time we get an element and act on it?
The solution to this is to re-match the template and assert that the coordinates are the same, before attempting to do an action. I.e., if a user calls imageElement.click()
in their client code, then the server would re-match the image template and assert that its coordinates match the previously-retrieved coordinates for that image element, before performing the tap. If there’s no match or the coordinates differ, a StaleElementException
is returned instead.
How does the server know how to re-match the image template? Well, the base64 image template data string would have to be stored as part of the image element hash. Since image data can get large, we would probably want to have some sort of ejection policy for the hash, maybe making it an LRU cache bounded by total memory usage or something.
This also points out a nagging issue, not just with finding and acting on image elements, but finding and acting on any element: the chance that, between the time an element is found and the time it is interacted with, the situation in the app might have changed. This is what StaleElementExceptions
are designed to notify us about. But what if we could greatly reduce the chance of such an exception, and speed up our tests in the process? We can!
3. findAndAct
I propose we bring back a feature we had in the old bad days of Apple’s UIAutomation driver, when we ran into this problem very frequently of elements being invalidated and unavailable for actions, just in the time it took for a client-server roundtrip. The primary motivation here is to help with extremely low-latency test applications (dynamic image matching), but the benefit extends to interacting with elements as a whole.
Basically, we add a new endpoint which encapsulates both finding and acting. It would take a strategy, selector, command name, and any command parameters. As a first pass, it could look like:
# endpoint
POST /session/:sessionid/appium/find_and_act
# request JSON
{
"using": "xpath",
"value": "//foo/bar",
"command": "sendKeys",
"params": {
"text": "hello world"
}
}
# response is the response for the action--no element is returned
# however, if an element is not found, NoSuchElementException
# would be returned
Since we want to act immediately on a found element, this only makes sense in the singular (no findElements support is necessary).
Putting all of these changes together, we have the potential for an extremely fast and reliable way of interacting with apps based on image matching, with side benefits for speeding up tests in other regards as well. We can reduce finding and tapping on a matched image to one client-server roundtrip, with the least possible chance of any slippage of app state (between finding a matched image coordinate set and tapping on it there is only the time that the tap itself takes, which should be in low ms).
I’d be happy to implement this proposal, but because it’s pretty large and cross-cutting I wanted to get conceptual buy-in from the other maintainers first and solicit other ideas or any issues you see with it.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:11
- Comments:18 (13 by maintainers)
Top GitHub Comments
By the way, if you’re all cool with P1 (image element finding on the server side), I can start implementing that while we debate the whole batched commands proposal.
I spent some time working on the batch commands problem this weekend, and might have something to show soon. It’s kind of a different idea, will see what you think of the PR 😃