Performance Optimizations
See original GitHub issue-
Faker version: Master branch
-
OS: Mac OS
I am a Performance Engineer at Salesforce.org and I use Faker within my tool Snowfakery to generate hundreds of millions of rows of data. Some teams within my company have avoided Faker because they say it is too slow.
I’m a huge fan of Faker, but it is true that despite the complexity of Snowfakery overall (including writing to a SQL database), Faker is often the bottleneck. Luckily, it seems easy to fix. I investigated and found some very quick wins in terms of performance.
If the person creating the Faker() object could have more influence on the random_element method, they could get a gigantic speedup. I understand that some people in some cases need the sophisticated distribution features of the underlying Faker library, but my users do not, so I wish I could turn them off.
Steps to reproduce
from unittest.mock import patch
import random
from collections import OrderedDict
import faker
import timeit
def bench():
f = faker.Faker()
print(timeit.timeit(lambda:f.first_name(), number=100000))
def fast_random_element(self, choices):
if isinstance(choices, OrderedDict):
return random.choice(tuple(choices.keys()))
else:
return random.choice(choices)
def fast_random_element_2(self, choices):
if isinstance(choices, OrderedDict):
if not hasattr(choices, "_cached_choice_list"):
setattr(choices, "_cached_choice_list", tuple(choices.keys()))
choices = choices._cached_choice_list
return random.choice(choices)
def all_bench():
print("Warmup")
bench()
print("Normal")
bench()
with patch("faker.providers.BaseProvider.random_element", fast_random_element):
from faker import providers
print("Simple optimization - No caching")
bench()
with patch("faker.providers.BaseProvider.random_element", fast_random_element_2):
print("With caching")
bench()
print("Baseline again")
bench()
all_bench()
Expected behaviour
Faker should be roughly as fast for simple element choices as Python itself.
Actual behavior
These times are in seconds:
Faker
6.72312176
Simpler optimization
1.888246217999999
With caching
0.3221195019999996
I am willing to submit a PR for this but I might need some guidance about where you would want to store the “weighted or fast” flag. Perhaps pass it from the Faker() constructor to the Provider objects?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (8 by maintainers)
Top GitHub Comments
@prescod That sounds too magical. I’d rather keep it simple and shift the responsibility to the user.
I just ran “pytest” without tox.