Replace each nested list comprehension with a single DB query in BIDSLayout.__repr__
See original GitHub issueI think there is currently some serious performance issues with BIDSLayout. Using a somewhat average database of 132 subjects (1 session and 1 run per subject), it needs about 1:15 minute to get a layout object. Using the following code:
from bids import BIDSLayout
%lprun -f BIDSLayout.__init__ BIDSLayout("/media/christian/ElementsSE/MPI-Leipzig_Mind-Brain-Body-LEMON/BIDS_LEMON/")
I get the following profiling report :
Total time: 76.4714 s
File: /home/christian/pybids/bids/layout/layout.py
Function: __init__ at line 196
Line # Hits Time Per Hit % Time Line Contents
==============================================================
196 def __init__(self, root, validate=True, absolute_paths=True,
197 derivatives=False, config=None, sources=None, ignore=None,
198 force_index=None, config_filename='layout_config.json',
199 regex_search=False, database_path=None, database_file=None,
200 reset_database=False, index_metadata=True):
201 """Initialize BIDSLayout."""
202 1 4.0 4.0 0.0 self.root = str(root)
203 1 2.0 2.0 0.0 self.validate = validate
204 1 1.0 1.0 0.0 self.absolute_paths = absolute_paths
205 1 2.0 2.0 0.0 self.derivatives = {}
206 1 2.0 2.0 0.0 self.sources = sources
207 1 3.0 3.0 0.0 self.regex_search = regex_search
208 1 2.0 2.0 0.0 self.config_filename = config_filename
209 # Store original init arguments as dictionary
210 1 3.0 3.0 0.0 self._init_args = self._sanitize_init_args(
211 1 2.0 2.0 0.0 root=root, validate=validate, absolute_paths=absolute_paths,
212 1 2.0 2.0 0.0 derivatives=derivatives, ignore=ignore, force_index=force_index,
213 1 91.0 91.0 0.0 index_metadata=index_metadata, config=config)
214
215 1 4.0 4.0 0.0 if database_path is None and database_file is not None:
216 database_path = database_file
217 warnings.warn(
218 'In pybids 0.10 database_file argument was deprecated in favor'
219 ' of database_path, and will be removed in 0.12. '
220 'For now, treating database_file as a directory.',
221 DeprecationWarning)
222 1 4.0 4.0 0.0 if database_path:
223 database_path = str(Path(database_path).absolute())
224
225 1 47.0 47.0 0.0 self.session = None
226
227 1 25891.0 25891.0 0.0 index_dataset = self._init_db(database_path, reset_database)
228
229 # Do basic BIDS validation on root directory
230 1 488.0 488.0 0.0 self._validate_root()
231
232 1 4.0 4.0 0.0 if ignore is None:
233 1 3.0 3.0 0.0 ignore = self._default_ignore
234
235 # Instantiate after root validation to ensure os.path.join works
236 1 3.0 3.0 0.0 self.ignore = [os.path.abspath(os.path.join(self.root, patt))
237 if isinstance(patt, str) else patt
238 1 102.0 102.0 0.0 for patt in listify(ignore or [])]
239 1 3.0 3.0 0.0 self.force_index = [os.path.abspath(os.path.join(self.root, patt))
240 if isinstance(patt, str) else patt
241 1 4.0 4.0 0.0 for patt in listify(force_index or [])]
242
243 # Initialize the BIDS validator and examine ignore/force_index args
244 1 4.0 4.0 0.0 self._validate_force_index()
245
246 1 1.0 1.0 0.0 if index_dataset:
247 # Create Config objects
248 1 2.0 2.0 0.0 if config is None:
249 1 2.0 2.0 0.0 config = 'bids'
250 1 1.0 1.0 0.0 config = [Config.load(c, session=self.session)
251 1 62271.0 62271.0 0.1 for c in listify(config)]
252 1 15.0 15.0 0.0 self.config = {c.name: c for c in config}
253 # Missing persistence of configs to the database
254 2 6.0 3.0 0.0 for config_obj in self.config.values():
255 1 308.0 308.0 0.0 self.session.add(config_obj)
256 1 27372.0 27372.0 0.0 self.session.commit()
257
258 # Index files and (optionally) metadata
259 1 35.0 35.0 0.0 indexer = BIDSLayoutIndexer(self)
260 1 28127988.0 28127988.0 36.8 indexer.index_files()
261 1 3.0 3.0 0.0 if index_metadata:
262 1 48226769.0 48226769.0 63.1 indexer.index_metadata()
263 else:
264 # Load Configs from DB
265 self.config = {c.name: c for c in self.session.query(Config).all()}
266
267 # Add derivatives if any are found
268 1 3.0 3.0 0.0 if derivatives:
269 if derivatives is True:
270 derivatives = os.path.join(root, 'derivatives')
271 self.add_derivatives(
272 derivatives, parent_database_path=database_path,
273 validate=validate, absolute_paths=absolute_paths,
274 derivatives=None, sources=self, ignore=ignore, config=None,
275 force_index=force_index, config_filename=config_filename,
276 regex_search=regex_search, index_metadata=index_metadata,
277 reset_database=index_dataset or reset_database
278 )
For day-to-day interaction with a dataset, development tests, etc., this kind of delay seems prohibitive to me…
Issue Analytics
- State:
- Created 4 years ago
- Comments:27
Top Results From Across the Web
Replace Value Nested List Comprehension - python
The nested list comp is excessive IMO. array2 is really the wrong data structure for what it ... This should get what you...
Read more >How to Write Nested List Comprehensions in Python | Built In
List comprehensions create a new list by scanning all items in a list, checking to see if they meet a condition, passing the...
Read more >Python List Comprehension Tutorial - DataCamp
Learn how to effectively use list comprehension in Python to create lists, to replace (nested) for loops and the map(), filter() and reduce()...
Read more >Nested List Comprehensions in Python - GeeksforGeeks
It is a smart and concise way of creating lists by iterating over an iterable object. Nested List Comprehensions are nothing but a...
Read more >vocab.txt - Hugging Face
... mode ##wer templ ##ream results ##ler ##ples ##ired mult last db ##ature sum appl back extra dim ##pert exec ip search level...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Actually, looking at this more closely, looks
__repr__
is not using the sql queries, and is doing some slow nested list comprehensions, so this is probably the issue. There is a TODO comment to implement this.Especially calculating the number of sessions, looking over subjects, is pretty slow.
https://github.com/bids-standard/pybids/blob/35e1296202959d375e570d08078282c26ad02bc0/bids/layout/layout.py#L302
Ah yeah, you’re right. I’m still not getting how to use just the Tag model to count the distinct combinations of subject and session though.