question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python ResultsReader class is incredibly slow

See original GitHub issue

It’s possible that I don’t understand what all the ResultsReader class is doing, so take this issue with a grain of salt.

When using the ResultsReader class to get results from a Splunk search, as indicated here: https://github.com/splunk/splunk-sdk-python/blob/master/splunklib/results.py#L173-L181, it takes an incredibly long time to get all the results. Using the jobs.results function is orders of magnitude faster.

For example, on a search with 175k results, it takes 4+ minutes to get the results with ResultsReader objects, and 3.7 seconds with the results function. The following snippet shows what I’m talking about:

import splunklib.results as results
import splunklib.client as client
from datetime import datetime
import json


splunk_object = client.connect(
    host="host",
    port="port",
    username="username",
    password="password",
    app="app",
    verify=True,
    autologin=True)

spl = '| makeresults count=175000'

splunk_search_kwargs = {"exec_mode": "blocking",
                        "earliest_time": "-48h",
                        "latest_time": "now",
                        "enable_lookups": "true"}

splunk_search_job = splunk_object.jobs.create(spl, **splunk_search_kwargs)


start_time_json = datetime.now()
# Get the results from the Splunk search
search_results_json = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    r = splunk_search_job.results(**{"count": max_get, "offset": get_offset, "output_mode": "json"})
    obj = json.loads(r.read())
    search_results_json.extend(obj['results'])
    get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))

end_time_json = datetime.now()


start_time = datetime.now()
# Get the results from the Splunk search
search_results = []
# log_general.debug("Getting Splunk search results.")
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    rr = results.ResultsReader(splunk_search_job.results(**{"count": max_get, "offset": get_offset}))
    for result in rr:
        if isinstance(result, results.Message):
            # Diagnostic messages may be returned in the results
            print '%s: %s' % (result.type, result.message)
        elif isinstance(result, dict):
            # Normal events are returned as dicts
            search_results.append(result)
    get_offset += max_get
# log_general.debug("Found %d results" % len(search_results))

end_time = datetime.now()

print ("ResultsReader time: %s" % (end_time-start_time).seconds)
print ("json_results time: %s" % (end_time_json-start_time_json).seconds)

Is ResultsReader doing anything special that I miss out on by just getting the results is json mode directly? I know that ResultsReader uses XML under the hood, but that doesn’t really matter to me; at the end of the day, I just need the results in a python object.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:7
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

5reactions
mew1033commented, Oct 29, 2021

My solution is to just not use ResultsReader at all. Just use the raw .results method and grab it as json. Something like this:

import json

spl = 'your search goes here'

splunk_search_kwargs = {"exec_mode": "blocking",
                        "earliest_time": "-48h",
                        "latest_time": "now",
                        "enable_lookups": "true"}

splunk_search_job = splunk_object.jobs.create(spl, **splunk_search_kwargs)
search_results_json = []
get_offset = 0
max_get = 49000
result_count = int(splunk_search_job['resultCount'])
while (get_offset < result_count):
    r = splunk_search_job.results(**{"count": max_get, "offset": get_offset, "output_mode": "json"})
    obj = json.loads(r.read())
    search_results_json.extend(obj['results'])
    get_offset += max_get
4reactions
bretlowerycommented, Aug 14, 2019

Workarounds:

  1. Use job.export instead of job.create or job.oneshot
  2. Use a custom streaming wrapper in conjunction with io.BufferedReader

Using both of these improved my performance 100x+ Details HERE

IMHO the architectural design as described in the link is terrible. Splunk seems to constantly want to protect it users from themselves which is ok in a GUI but terrible in an API where engineers presumably know what they are doing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Python SDK - results.ResultsReader extremely slow
Solved: I'm writing a search using the example from the SDK below. My search matches around 220000 results and the search finishes in...
Read more >
Splunk Python SDK API job.results limited to ... - Stack Overflow
I was able to get this working successfully. My code below should demonstrate how this is accomplished. import io import csv from time ......
Read more >
Splunk Python SDK API job.results limited to 50k ... - Reddit
I have a job who's job['resultCount'] is 367k, but no matter what I do, I can't seem to pull more than the first...
Read more >
splunklib.results — Splunk SDK for Python API Reference
This class returns dictionaries and Splunk messages from an XML results stream. ResultsReader is iterable, and returns a dict for results, or a...
Read more >
guest/paul_allen/dev/p4-splunk/bin/splunklib/client.py - Swarm
Service); :class:`Service` objects have fields for the various Splunk ... You can also access the fields as though they were the fields of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found