Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python ResultsReader class is incredibly slow

See original GitHub issue

It’s possible that I don’t understand what all the ResultsReader class is doing, so take this issue with a grain of salt.

When using the ResultsReader class to get results from a Splunk search, as indicated here: https://github.com/splunk/splunk-sdk-python/blob/master/splunklib/results.py#L173-L181, it takes an incredibly long time to get all the results. Using the jobs.results function is orders of magnitude faster.

For example, on a search with 175k results, it takes 4+ minutes to get the results with ResultsReader objects, and 3.7 seconds with the results function. The following snippet shows what I’m talking about:

import splunklib.results as results
import splunklib.client as client
from datetime import datetime
import json


splunk_object = client.connect(
    host="host",
    port="port",
    username="username",
    password="password",
    app="app",
    verify=True,
    autologin=True)

spl = '| makeresults count=175000'

splunk_search_kwargs = {"exec_mode": "blocking",
                        "earliest_time": "-48h",
                        "latest_time": "now",
                        "enable_lookups": "true"}

splunk_search_job = splunk_object.jobs.create(spl, **splunk_search_kwargs)


start_time_json = datetime.now()
# Get the results from the Splunk search
search_results_json = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    r = splunk_search_job.results(**{"count": max_get, "offset": get_offset, "output_mode": "json"})
    obj = json.loads(r.read())
    search_results_json.extend(obj['results'])
    get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))

end_time_json = datetime.now()


start_time = datetime.now()
# Get the results from the Splunk search
search_results = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    rr = results.ResultsReader(splunk_search_job.results(**{"count": max_get, "offset": get_offset}))
    for result in rr:
        if isinstance(result, results.Message):
            # Diagnostic messages may be returned in the results
            print '%s: %s' % (result.type, result.message)
        elif isinstance(result, dict):
            # Normal events are returned as dicts
            search_results.append(result)
    get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))

end_time = datetime.now()

print ("ResultsReader time: %s" % (end_time-start_time).seconds)
print ("json_results time: %s" % (end_time_json-start_time_json).seconds)

Is ResultsReader doing anything special that I miss out on by just getting the results is json mode directly? I know that ResultsReader uses XML under the hood, but that doesn’t really matter to me; at the end of the day, I just need the results in a python object.

Issue Analytics

State:
Created 5 years ago
Reactions:7
Comments:10 (3 by maintainers)

Top GitHub Comments

5reactions

mew1033commented, Oct 29, 2021

My solution is to just not use ResultsReader at all. Just use the raw .results method and grab it as json. Something like this:

import json

spl = 'your search goes here'

splunk_search_kwargs = {"exec_mode": "blocking",
                        "earliest_time": "-48h",
                        "latest_time": "now",
                        "enable_lookups": "true"}

splunk_search_job = splunk_object.jobs.create(spl, **splunk_search_kwargs)
search_results_json = []
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    r = splunk_search_job.results(**{"count": max_get, "offset": get_offset, "output_mode": "json"})
    obj = json.loads(r.read())
    search_results_json.extend(obj['results'])
    get_offset += max_get

4reactions

bretlowerycommented, Aug 14, 2019

Workarounds:

Use job.export instead of job.create or job.oneshot
Use a custom streaming wrapper in conjunction with io.BufferedReader

Using both of these improved my performance 100x+ Details HERE

IMHO the architectural design as described in the link is terrible. Splunk seems to constantly want to protect it users from themselves which is ok in a GUI but terrible in an API where engineers presumably know what they are doing.