Heavy disk write in long run
See original GitHub issueI set up a local instance (iBot) of Smokey in a Docker container, and recently I noticed unintelligibly high disk write amount (read from Block I/O
column of docker stats
). At present, the container that’s been up for 7 days has accrued a total block write amount of more than 150 GiB, which divides into roughly 20 GiB per day.
Upon closer inspection, I noticed that the file bodyfetcherQueueTimings.p
was written for once every few seconds, and at the time it had a size of 2.1 MiB. For every update its size grew by a few bytes. Given that it’s written once for every API call to fetch posts from SE and that we’re using around 10k of our daily quota, it’s most likely the culprit for such a huge write amount.
Per my knowledge, that pickle file doesn’t have any use for Smokey’s primary functionalities (it’s for data analysis by human), and this file alone is wearing out SSDs very quickly (one Samsung 970 EVO 500 GB model in less than 3 years if unattended, calculating from raw bytes). Should we take some measures to reduce unnecessary hardware workloads?
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (7 by maintainers)
Top GitHub Comments
I have two proposals for how we can go about fixing this issue:
Disable queue timing logging by default. This data isn’t necessary for Smokey to correctly function, and isn’t accessed during normal operation. We only use this data on an infrequent basis; thus, it would make sense to disable this logging by default, and only enable it on the rare occasion that we actually need to collect the data (e.g. by passing some form of option to the startup command, like we have for standby etc)
Change the data to a format which can be easily be appended to the end of a file, so that it doesn’t require an overwrite. Pickle, as a serialised format, is difficult to modify without loading the binary data into Python objects, changing it, and re-writing the data to file. This downside is worth it if you’re dealing with complex Python objects that are hard to store by other means; however, we’re only using it to store relatively simple data (a series of lists of integers (or floats?), with each number representing the time the post waited for an API request):
(side note: wouldn’t our current implementation benefit from the use of
collections.defaultdict
?)Given that all the analysis is done by a totally separate script, I don’t think we’re benefiting greatly from pickle’s ability to store python objects directly. Instead, I propose that we take a simpler approach, and convert the data to one timing per line, in the format
site timing
(e.g.stackoverflow.com 231
). Then, at a later time the data analysis script can aggregate the data into a dictionary, or whatever other format is most convenient for analysis. This method has the advantage that new data can easily be appended to the end of the file without having to overwrite it, thus saving disk writes.Better ideas for a new data format are welcomed.
This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.