Common database operations are slow
See original GitHub issueIs your feature request related to a problem? Please describe.
Currently, Ghidra uses its own custom database, which supports only rudimentary indexes and filtering, which means that almost all filtering and sorting has to be done on a higher level by iterating over all possible records (see TreeTextFilter
). This results in a very slow user experience for common actions, like searching for a symbol (related issue: #500).
I measured the time to filter symbols (in the “Symbol Tree” using the “Contains” mode) by a given string in a large project (roughly a million symbols) to be around 15s (computer specs: Ryzen 3700X, 32GB RAM, mid-range NVME SSD). I then exported all symbols into a CSV file using Ghidra’s Jython
outf = open("CSV_PATH", "w")
for s in currentProgram.getSymbolTable().getAllSymbols(False):
outf.write(str(s.getAddress().getOffset()) + ',"' + s.getName() + '"\n')
outf.close()
and imported the CSV into three SQL databases (H2, PostgreSQL and SQLite)
-- For H2:
CREATE TABLE symbols (address bigint, name text)
AS SELECT * FROM csvread('CSV_PATH');
-- For PostgreSQL:
CREATE TABLE symbols (address bigint, name text);
COPY symbols FROM 'CSV_PATH' WITH (FORMAT CSV);
-- For SQLite
CREATE TABLE symbols (address bigint, name text);
.mode csv
.import CSV_PATH symbols
A subsequent SELECT * FROM symbols WHERE name ILIKE '%SEARCH_TERM%'
or equivalent executed in ~1000ms on H2, ~250ms on PostgreSQL and ~100ms on SQLite.
This isn’t meant to compare the databases, but to show that, even without adding fulltext indexes, common SQL databases might be at least 15x better than Ghidra’s current code. I’m not very familiar with Ghidra’s internals, so please point out any problems that might invalidate these results.
Describe the solution you’d like I’d like to propose moving filtering and sorting functionality into the database layer by adopting an existing database (a relational database seems most suitable), abstracting it under the same API used today and adding “DB-aware” code to hotspots that are currently slow. This would have the added benefit of relying on a well-maintained open source project instead of custom code, making maintenance easier.
Describe alternatives you’ve considered It is certainly possible to optimise Ghidra’s current database to get performance comparable to existing, faster databases. I feel, however, that this would take more effort than integrating an existing database, both in up-front effort and especially in maintenance.
It’s also possible that the current performance is caused by bugs that can be fixed without such architectural changes, which would definitely be preferable.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
I’m pretty sure the custom database is not the issue here. I just did a test and I was able to retrieve 1M records and filtered them and it took 533 ms, which is in the ballpark of the numbers you mentioned. I suspect the problem has more to do with the way we are mapping the records to Symbol objects. We may be doing it in an inefficient way where we are only keeping the symbol record keys in memory and every time we access a symbol field (such as its name), we may have to retrieve the symbol record again and again as we filter and sort. There is a soft cache for the symbol objects, but if you are low on memory and the garbage collector runs, those objects will be reclaimed, forcing it to constantly go back to the database to retrieve the information. Anyway, this is something we can look into and see what is really causing the slowness.
Since this doesn’t appear to be a DB issue, I am going to close the ticket. Speedup pertaining to the Symbol Tree/Table can be tracked in #500.