question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large memory footprint in astropy.io.votable.parse_single_table

See original GitHub issue

I was trying to read two columns from an ~ 1 GB votable file (demo Gaia dr2 data). The file itself contains ~ 96 columns. The code I used was:

from astropy.io.votable import parse_single_table
columns = ['phot_g_mean_mag', 'parallax']
table = parse_single_table("async_20190630210155.vot", columns=columns)
print("Done reading table")

Here’s the file info:

$ ls -alh async_20190630210155.vot 
-rw-rw-rw- 1 msinha 1195219923 1.1G Jul  1 14:01 async_20190630210155.vot

Looking at the memory footprint, I saw that python was taking ~12 GB during the read and I cancelled the kernel (this is within a Jupyter notebook). Here’s my screenshot showing the memory usage: Screen Shot 2019-07-02 at 10 13 50 am

While I know that there is a significant python overhead, it still seems like a lot of memory to read only 2 columns (out of 96). By my math, the (minimum) possible size is 2/96*1 GB ~ 0.02 GB

Since I am new to both astropy and votables, perhaps I am doing something incorrectly. Happy to provide further info or help debug, as necessary. In case there is something inherently wrong with the file itself, here’s a dropbox link to the file.

Cheers, Manodeep

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:19 (19 by maintainers)

github_iconTop GitHub Comments

2reactions
manodeepcommented, Mar 13, 2020

Probably wouldn’t be the case in the immigrant library 😉

1reaction
pllimcommented, Mar 18, 2020

I tried memory profiling by extracting the code that I thought was relevant into its own file for the profiler to crawl through. Rename this to run_profiler.py: run_profiler.py.txt

Then I ran the command python -m memory_profiler run_profiler.py using memory-profiler 0.55.0. It took a good few hours! Here are the results that I got but I’ll have to come back and contemplate this later:

VOTableFile.parse()
Line #    Mem usage    Increment   Line Contents
================================================
   164     69.4 MiB     69.4 MiB       @profile
   165                                 def parse(self, iterator, config):
   166     69.4 MiB      0.0 MiB           config['_current_table_number'] = 0
   167                             
   168     69.5 MiB      0.2 MiB           for start, tag, data, pos in iterator:
   169     69.5 MiB      0.0 MiB               if start:
   170     69.5 MiB      0.0 MiB                   if tag == 'xml':
   171     69.5 MiB      0.0 MiB                       pass
   172     69.5 MiB      0.0 MiB                   elif tag == 'VOTABLE':
   173     69.5 MiB      0.0 MiB                       if 'version' not in data:
   174                                                     warn_or_raise(W20, W20, self.version, config, pos)
   175                                                     config['version'] = self.version
   176                                                 else:
   177     69.5 MiB      0.0 MiB                           config['version'] = self._version = data['version']
   178     69.5 MiB      0.0 MiB                           if config['version'].lower().startswith('v'):
   179                                                         warn_or_raise(
   180                                                             W29, W29, config['version'], config, pos)
   181                                                         self._version = config['version'] = config['version'][1:]  # noqa
   182     69.5 MiB      0.0 MiB                           if config['version'] not in ('1.1', '1.2', '1.3',
   183                                                                                  '1.4'):
   184                                                         vo_warn(W21, config['version'], config, pos)
   185                             
   186     69.5 MiB      0.0 MiB                       if 'xmlns' in data:
   187                                                     # Starting with VOTable 1.3, namespace URIs stop
   188                                                     # incrementing with minor version changes.  See
   189                                                     # this IVOA note for more info:
   190                                                     # http://www.ivoa.net/documents/Notes/XMLVers/20180529/
   191                                                     #
   192                                                     # If this policy is in place for major version 2,
   193                                                     # then this logic will need tweaking.
   194     69.5 MiB      0.0 MiB                           if config['version'] in ('1.3', '1.4'):
   195     69.5 MiB      0.0 MiB                               ns_version = '1.3'
   196                                                     else:
   197                                                         ns_version = config['version']
   198                                                     correct_ns = (
   199     69.5 MiB      0.0 MiB                               'http://www.ivoa.net/xml/VOTable/v{}'.format(
   200     69.5 MiB      0.0 MiB                                   ns_version))
   201     69.5 MiB      0.0 MiB                           if data['xmlns'] != correct_ns:
   202                                                         vo_warn(
   203                                                             W41, (correct_ns, data['xmlns']), config, pos)
   204                                                 else:
   205                                                     vo_warn(W42, (), config, pos)
   206                             
   207     69.5 MiB      0.0 MiB                       break
   208                                             else:
   209                                                 vo_raise(E19, (), config, pos)
   210                                     config['version_1_1_or_later'] = \
   211     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.1') >= 0
   212                                     config['version_1_2_or_later'] = \
   213     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.2') >= 0
   214                                     config['version_1_3_or_later'] = \
   215     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.3') >= 0
   216                                     config['version_1_4_or_later'] = \
   217     69.5 MiB      0.0 MiB               util.version_compare(config['version'], '1.4') >= 0
   218                             
   219                                     tag_mapping = {
   220     69.5 MiB      0.0 MiB               'PARAM': self._add_param,
   221     69.5 MiB      0.0 MiB               'RESOURCE': self._add_resource,
   222     69.5 MiB      0.0 MiB               'COOSYS': self._add_coosys,
   223     69.5 MiB      0.0 MiB               'TIMESYS': self._add_timesys,
   224     69.5 MiB      0.0 MiB               'INFO': self._add_info,
   225     69.5 MiB      0.0 MiB               'DEFINITIONS': self._add_definitions,
   226     69.5 MiB      0.0 MiB               'DESCRIPTION': self._ignore_add,
   227     69.5 MiB      0.0 MiB               'GROUP': self._add_group}
   228                             
   229   4260.0 MiB      0.0 MiB           for start, tag, data, pos in iterator:
   230   4260.0 MiB      0.0 MiB               if start:
   231     69.5 MiB      0.0 MiB                   tag_mapping.get(tag, self._add_unknown_tag)(
   232   4260.0 MiB   4190.4 MiB                       iterator, tag, data, config, pos)
   233   4260.0 MiB      0.0 MiB               elif tag == 'DESCRIPTION':
   234                                             if self.description is not None:
   235                                                 warn_or_raise(W17, W17, 'VOTABLE', config, pos)
   236                                             self.description = data or None
   237                             
   238   4260.0 MiB      0.0 MiB           if not len(self.resources) and config['version_1_2_or_later']:
   239                                         warn_or_raise(W53, W53, (), config, pos)
   240                             
   241   4260.0 MiB      0.0 MiB           return self
Higher level parse()
Line #    Mem usage    Increment   Line Contents
================================================
   500     69.2 MiB     69.2 MiB   @profile
   501                             def parse(source, columns=None, invalid='exception', verify=None,
   502                                       chunk_size=tree.DEFAULT_CHUNK_SIZE, table_number=None,
   503                                       table_id=None, filename=None, unit_format=None,
   504                                       datatype_mapping=None, _debug_python_based_parser=False):
...
   585     69.2 MiB      0.0 MiB       from astropy.io.votable import conf
   586                             
   587     69.2 MiB      0.0 MiB       invalid = invalid.lower()
   588     69.2 MiB      0.0 MiB       if invalid not in ('exception', 'mask'):
   589                                     raise ValueError("accepted values of ``invalid`` are: "
   590                                                      "``'exception'`` or ``'mask'``.")
   591                             
   592     69.2 MiB      0.0 MiB       if verify is None:
   593                             
   594                                     # NOTE: since the pedantic argument isn't fully deprecated yet, we need
   595                                     # to catch the deprecation warning that occurs when accessing the
   596                                     # configuration item, but only if it is for the pedantic option in the
   597                                     # [io.votable] section.
   598     69.2 MiB      0.0 MiB           with warnings.catch_warnings():
   599     69.2 MiB      0.0 MiB               warnings.filterwarnings(
   600     69.2 MiB      0.0 MiB                   "ignore",
   601     69.2 MiB      0.0 MiB                   r"Config parameter \'pedantic\' in section \[io.votable\]",
   602     69.2 MiB      0.0 MiB                   AstropyDeprecationWarning)
   603     69.2 MiB      0.0 MiB               conf_verify_lowercase = conf.verify.lower()
   604                             
   605                                     # We need to allow verify to be booleans as strings since the
   606                                     # configuration framework doesn't make it easy/possible to have mixed
   607                                     # types.
   608     69.2 MiB      0.0 MiB           if conf_verify_lowercase in ['false', 'true']:
   609                                         verify = conf_verify_lowercase == 'true'
   610                                     else:
   611     69.2 MiB      0.0 MiB               verify = conf_verify_lowercase
   612                             
   613     69.2 MiB      0.0 MiB       if isinstance(verify, bool):
   614                                     verify = 'exception' if verify else 'warn'
   615     69.2 MiB      0.0 MiB       elif verify not in VERIFY_OPTIONS:
   616                                     raise ValueError('verify should be one of {}'.format(
   617                                         '/'.join(VERIFY_OPTIONS)))
   618                             
   619     69.2 MiB      0.0 MiB       if datatype_mapping is None:
   620     69.2 MiB      0.0 MiB           datatype_mapping = {}
   621                             
   622                                 config = {
   623     69.2 MiB      0.0 MiB           'columns': columns,
   624     69.2 MiB      0.0 MiB           'invalid': invalid,
   625     69.2 MiB      0.0 MiB           'verify': verify,
   626     69.2 MiB      0.0 MiB           'chunk_size': chunk_size,
   627     69.2 MiB      0.0 MiB           'table_number': table_number,
   628     69.2 MiB      0.0 MiB           'filename': filename,
   629     69.2 MiB      0.0 MiB           'unit_format': unit_format,
   630     69.2 MiB      0.0 MiB           'datatype_mapping': datatype_mapping
   631                                 }
   632                             
   633     69.2 MiB      0.0 MiB       if filename is None and isinstance(source, str):
   634     69.2 MiB      0.0 MiB           config['filename'] = source
   635                             
   636     69.2 MiB      0.0 MiB       with iterparser.get_xml_iterator(
   637     69.2 MiB      0.0 MiB               source,
   638     69.4 MiB      0.2 MiB               _debug_python_based_parser=_debug_python_based_parser) as iterator:
   639     69.4 MiB      0.0 MiB           return VOTableFile(
   640   4260.0 MiB   4190.6 MiB               config=config, pos=(1, 1)).parse(iterator, config)
Read more comments on GitHub >

github_iconTop Results From Across the Web

VOTable XML Handling (astropy.io.votable) — Astropy v5.2
The astropy.io.votable sub-package converts VOTable XML files to and from numpy record arrays. This subpackage was originally developed as vo.table .
Read more >
parse — Astropy v5.2
Parses a VOTABLE xml file (or file-like object), and returns a VOTableFile object. Parameters. sourcepath-like object or file-like object. Path or file-like ...
Read more >
astropy.io.votable.exceptions — Astropy v5.1.1
The VOTable specification uses the attribute name ID (with uppercase letters) to specify unique identifiers. Some VOTable-producing tools use the more standard ...
Read more >
Create a very large FITS file from scratch — Astropy v5.1.1
Then use the astropy.io.fits.writeto() method to write out the new file ... Most systems won't be able to create that in memory just...
Read more >
astropy.io.votable.tree.Table
Returns True if this table doesn't contain any real data because it was skipped over by the parser (through use of the table_number...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found