Python ResultsReader class is incredibly slow
See original GitHub issueIt’s possible that I don’t understand what all the ResultsReader class is doing, so take this issue with a grain of salt.
When using the ResultsReader
class to get results from a Splunk search, as indicated here: https://github.com/splunk/splunk-sdk-python/blob/master/splunklib/results.py#L173-L181, it takes an incredibly long time to get all the results. Using the jobs.results
function is orders of magnitude faster.
For example, on a search with 175k results, it takes 4+ minutes to get the results with ResultsReader
objects, and 3.7 seconds with the results
function. The following snippet shows what I’m talking about:
import splunklib.results as results
import splunklib.client as client
from datetime import datetime
import json
splunk_object = client.connect(
host="host",
port="port",
username="username",
password="password",
app="app",
verify=True,
autologin=True)
spl = '| makeresults count=175000'
splunk_search_kwargs = {"exec_mode": "blocking",
"earliest_time": "-48h",
"latest_time": "now",
"enable_lookups": "true"}
splunk_search_job = splunk_object.jobs.create(spl, **splunk_search_kwargs)
start_time_json = datetime.now()
# Get the results from the Splunk search
search_results_json = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
r = splunk_search_job.results(**{"count": max_get, "offset": get_offset, "output_mode": "json"})
obj = json.loads(r.read())
search_results_json.extend(obj['results'])
get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))
end_time_json = datetime.now()
start_time = datetime.now()
# Get the results from the Splunk search
search_results = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
rr = results.ResultsReader(splunk_search_job.results(**{"count": max_get, "offset": get_offset}))
for result in rr:
if isinstance(result, results.Message):
# Diagnostic messages may be returned in the results
print '%s: %s' % (result.type, result.message)
elif isinstance(result, dict):
# Normal events are returned as dicts
search_results.append(result)
get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))
end_time = datetime.now()
print ("ResultsReader time: %s" % (end_time-start_time).seconds)
print ("json_results time: %s" % (end_time_json-start_time_json).seconds)
Is ResultsReader
doing anything special that I miss out on by just getting the results is json mode directly? I know that ResultsReader
uses XML under the hood, but that doesn’t really matter to me; at the end of the day, I just need the results in a python object.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:7
- Comments:10 (3 by maintainers)
My solution is to just not use
ResultsReader
at all. Just use the raw.results
method and grab it as json. Something like this:Workarounds:
Using both of these improved my performance 100x+ Details HERE
IMHO the architectural design as described in the link is terrible. Splunk seems to constantly want to protect it users from themselves which is ok in a GUI but terrible in an API where engineers presumably know what they are doing.