pd.read_json yields: OSError: [Errno 22] Invalid argument
See original GitHub issueCode Sample, a copy-pastable example if possible
data = '/Users/davidleifer/Desktop/Geog500/thesis/data/merged-file.json'
df = pd.read_json(data, lines=True)
Problem description
The JSON file contains Twitter data scraped using their API. I’ve limited the files to 10,000 tweets per file. I clean the files using this process:
- Merge files in directory using: cat * > merged-file.json
- Remove blank lines in Sublime Text using Find and Replace: ^\n.
Here is an example Tweet (one tweet per line):
{“created_at”:“Thu Nov 02 08:08:01 +0000 2017”,“id”:925997914136002562,“id_str”:“925997914136002562”,“text”:“#RussianGate #FollowTheFacts #Resist #FakePresident #GOP #War #Vote #ClimateChange #Peace #Animals #Women https://t.co/xe7AEdod1Y”,“display_text_range”:[0,105],“source”:“\u003ca href="http://twitter.com" rel="nofollow"\u003eTwitter Web Client\u003c/a\u003e”,“truncated”:false,“in_reply_to_status_id”:null,“in_reply_to_status_id_str”:null,“in_reply_to_user_id”:null,“in_reply_to_user_id_str”:null,“in_reply_to_screen_name”:null,“user”:{“id”:760436942,“id_str”:“760436942”,“name”:“Athoughtz”,“screen_name”:“athoughtz”,“location”:“United States”,“url”:null,“description”:“#RussianGate #FollowTheFacts #Resist #FakePresident #GOP #War #Vote #ClimateChange #Peace #Animals #Women”,“translator_type”:“none”,“protected”:false,“verified”:false,“followers_count”:5063,“friends_count”:5064,“listed_count”:142,“favourites_count”:659,“statuses_count”:62057,“created_at”:“Thu Aug 16 00:11:12 +0000 2012”,“utc_offset”:-25200,“time_zone”:“Arizona”,“geo_enabled”:false,“lang”:“en”,“contributors_enabled”:false,“is_translator”:false,“profile_background_color”:“C0DEED”,“profile_background_image_url”:“http://abs.twimg.com/images/themes/theme1/bg.png”,“profile_background_image_url_https”:“https://abs.twimg.com/images/themes/theme1/bg.png”,“profile_background_tile”:false,“profile_link_color”:“1DA1F2”,“profile_sidebar_border_color”:“C0DEED”,“profile_sidebar_fill_color”:“DDEEF6”,“profile_text_color”:“333333”,“profile_use_background_image”:true,“profile_image_url”:“http://pbs.twimg.com/profile_images/378800000835488491/565d1bd43c8b0a615b8a39887e52ef2c_normal.jpeg”,“profile_image_url_https”:“https://pbs.twimg.com/profile_images/378800000835488491/565d1bd43c8b0a615b8a39887e52ef2c_normal.jpeg”,“default_profile”:true,“default_profile_image”:false,“following”:null,“follow_request_sent”:null,“notifications”:null},“geo”:null,“coordinates”:null,“place”:null,“contributors”:null,“is_quote_status”:false,“quote_count”:0,“reply_count”:0,“retweet_count”:0,“favorite_count”:0,“entities”:{“hashtags”:[{“text”:“RussianGate”,“indices”:[0,12]},{“text”:“FollowTheFacts”,“indices”:[13,28]},{“text”:“Resist”,“indices”:[29,36]},{“text”:“FakePresident”,“indices”:[37,51]},{“text”:“GOP”,“indices”:[52,56]},{“text”:“War”,“indices”:[57,61]},{“text”:“Vote”,“indices”:[62,67]},{“text”:“ClimateChange”,“indices”:[68,82]},{“text”:“Peace”,“indices”:[83,89]},{“text”:“Animals”,“indices”:[90,98]},{“text”:“Women”,“indices”:[99,105]}],“urls”:[],“user_mentions”:[],“symbols”:[],“media”:[{“id”:925997885778378752,“id_str”:“925997885778378752”,“indices”:[106,129],“media_url”:“http://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“media_url_https”:“https://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“url”:“https://t.co/xe7AEdod1Y”,“display_url”:“pic.twitter.com/xe7AEdod1Y”,“expanded_url”:“https://twitter.com/athoughtz/status/925997914136002562/photo/1”,“type”:“photo”,“sizes”:{“medium”:{“w”:600,“h”:585,“resize”:“fit”},“small”:{“w”:600,“h”:585,“resize”:“fit”},“thumb”:{“w”:150,“h”:150,“resize”:“crop”},“large”:{“w”:600,“h”:585,“resize”:“fit”}}}]},“extended_entities”:{“media”:[{“id”:925997885778378752,“id_str”:“925997885778378752”,“indices”:[106,129],“media_url”:“http://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“media_url_https”:“https://pbs.twimg.com/media/DNnOK8SVQAAUS6Z.jpg”,“url”:“https://t.co/xe7AEdod1Y”,“display_url”:“pic.twitter.com/xe7AEdod1Y”,“expanded_url”:“https://twitter.com/athoughtz/status/925997914136002562/photo/1”,“type”:“photo”,“sizes”:{“medium”:{“w”:600,“h”:585,“resize”:“fit”},“small”:{“w”:600,“h”:585,“resize”:“fit”},“thumb”:{“w”:150,“h”:150,“resize”:“crop”},“large”:{“w”:600,“h”:585,“resize”:“fit”}}}]},“favorited”:false,“retweeted”:false,“possibly_sensitive”:false,“filter_level”:“low”,“lang”:“und”,“timestamp_ms”:“1509610081596”} {“created_at”:“Thu Nov 02 08:08:02 +0000 2017”,“id”:925997918795866113,“id_str”:“925997918795866113”,“text”:“RT @CGTNOfficial: Survey released on Chinese public awareness of #climatechange https://t.co/q92jAnobmd”,“source”:“\u003ca href="http://nosudo.co" rel="nofollow"\u003eQxNews-python\u003c/a\u003e”,“truncated”:false,“in_reply_to_status_id”:null,“in_reply_to_status_id_str”:null,“in_reply_to_user_id”:null,“in_reply_to_user_id_str”:null,“in_reply_to_screen_name”:null,“user”:{“id”:1664059166,“id_str”:“1664059166”,“name”:“Question News”,“screen_name”:“QxNews”,“location”:“USA”,“url”:null,“description”:“Interrogare Semper | News bot/humans via retweets | 1 min per retweet”,“translator_type”:“none”,“protected”:false,“verified”:false,“followers_count”:3254,“friends_count”:271,“listed_count”:2786,“favourites_count”:38,“statuses_count”:1018592,“created_at”:“Mon Aug 12 03:35:37 +0000 2013”,“utc_offset”:-25200,“time_zone”:“Pacific Time (US & Canada)”,“geo_enabled”:false,“lang”:“en”,“contributors_enabled”:false,“is_translator”:false,“profile_background_color”:“000000”,“profile_background_image_url”:“http://pbs.twimg.com/profile_background_images/514662332492816384/TuhAkn7d.jpeg”,“profile_background_image_url_https”:“https://pbs.twimg.com/profile_background_images/514662332492816384/TuhAkn7d.jpeg”,“profile_background_tile”:false,“profile_link_color”:“000000”,“profile_sidebar_border_color”:“FFFFFF”,“profile_sidebar_fill_color”:“DDEEF6”,“profile_text_color”:“333333”,“profile_use_background_image”:true,“profile_image_url”:“http://pbs.twimg.com/profile_images/597288578092240896/ePlmSYCH_normal.png”,“profile_image_url_https”:“https://pbs.twimg.com/profile_images/597288578092240896/ePlmSYCH_normal.png”,“profile_banner_url”:“https://pbs.twimg.com/profile_banners/1664059166/1484679111”,“default_profile”:false,“default_profile_image”:false,“following”:null,“follow_request_sent”:null,“notifications”:null},“geo”:null,“coordinates”:null,“place”:null,“contributors”:null,“retweeted_status”:{“created_at”:“Thu Nov 02 07:55:00 +0000 2017”,“id”:925994638019825664,“id_str”:“925994638019825664”,“text”:“Survey released on Chinese public awareness of #climatechange https://t.co/q92jAnobmd”,“source”:“\u003ca href="https://about.twitter.com/products/tweetdeck" rel="nofollow"\u003eTweetDeck\u003c/a\u003e”,“truncated”:false,“in_reply_to_status_id”:null,“in_reply_to_status_id_str”:null,“in_reply_to_user_id”:null,“in_reply_to_user_id_str”:null,“in_reply_to_screen_name”:null,“user”:{“id”:1115874631,“id_str”:“1115874631”,“name”:“CGTN”,“screen_name”:“CGTNOfficial”,“location”:“Beijing, China”,“url”:“http://www.CGTN.com”,“description”:“China Global Television Network, or CGTN, is a multi-language, multi-platform media grouping.”,“translator_type”:“none”,“protected”:false,“verified”:true,“followers_count”:4828619,“friends_count”:53,“listed_count”:4517,“favourites_count”:32,“statuses_count”:39079,“created_at”:“Thu Jan 24 03:18:59 +0000 2013”,“utc_offset”:28800,“time_zone”:“Beijing”,“geo_enabled”:true,“lang”:“en”,“contributors_enabled”:false,“is_translator”:false,“profile_background_color”:“131516”,“profile_background_image_url”:“http://pbs.twimg.com/profile_background_images/378800000169084583/SqpyvnvQ.jpeg”,“profile_background_image_url_https”:“https://pbs.twimg.com/profile_background_images/378800000169084583/SqpyvnvQ.jpeg”,“profile_background_tile”:true,“profile_link_color”:“009999”,“profile_sidebar_border_color”:“FFFFFF”,“profile_sidebar_fill_color”:“EFEFEF”,“profile_text_color”:“333333”,“profile_use_background_image”:true,“profile_image_url”:“http://pbs.twimg.com/profile_images/815049165508112384/wJA8jWZh_normal.jpg”,“profile_image_url_https”:“https://pbs.twimg.com/profile_images/815049165508112384/wJA8jWZh_normal.jpg”,“profile_banner_url”:“https://pbs.twimg.com/profile_banners/1115874631/1483157766”,“default_profile”:false,“default_profile_image”:false,“following”:null,“follow_request_sent”:null,“notifications”:null},“geo”:null,“coordinates”:null,“place”:null,“contributors”:null,“is_quote_status”:false,“quote_count”:0,“reply_count”:0,“retweet_count”:10,“favorite_count”:25,“entities”:{“hashtags”:[{“text”:“climatechange”,“indices”:[47,61]}],“urls”:[{“url”:“https://t.co/q92jAnobmd”,“expanded_url”:“https://news.cgtn.com/news/794d7a4e33597a6333566d54/share_p.html”,“display_url”:“news.cgtn.com/news/794d7a4e3\u2026”,“indices”:[62,85]}],“user_mentions”:[],“symbols”:[]},“favorited”:false,“retweeted”:false,“possibly_sensitive”:false,“filter_level”:“low”,“lang”:“en”},“is_quote_status”:false,“quote_count”:0,“reply_count”:0,“retweet_count”:0,“favorite_count”:0,“entities”:{“hashtags”:[{“text”:“climatechange”,“indices”:[65,79]}],“urls”:[{“url”:“https://t.co/q92jAnobmd”,“expanded_url”:“https://news.cgtn.com/news/794d7a4e33597a6333566d54/share_p.html”,“display_url”:“news.cgtn.com/news/794d7a4e3\u2026”,“indices”:[80,103]}],“user_mentions”:[{“screen_name”:“CGTNOfficial”,“name”:“CGTN”,“id”:1115874631,“id_str”:“1115874631”,“indices”:[3,16]}],“symbols”:[]},“favorited”:false,“retweeted”:false,“possibly_sensitive”:false,“filter_level”:“low”,“lang”:“en”,“timestamp_ms”:“1509610082707”}
I get this error:
OSError Traceback (most recent call last) <ipython-input-4-5322def5edd5> in <module>() ----> 1 df = pd.read_json(data, lines=True)
/Users/davidleifer/anaconda/lib/python3.5/site-packages/pandas/io/json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines) 214 if exists: 215 with _get_handle(filepath_or_buffer, ‘r’, encoding=encoding) as fh: –> 216 json = fh.read() 217 else: 218 json = filepath_or_buffer
OSError: [Errno 22] Invalid argument
Expected Output
Loading the JSON into a pandas dataframe.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.19.0 nose: 1.3.7 pip: 9.0.1 setuptools: 36.2.7 Cython: 0.24 numpy: 1.13.2 scipy: 0.19.1 statsmodels: 0.6.1 xarray: None IPython: 4.2.0 sphinx: 1.4.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: 1.1.0 tables: 3.3.0 numexpr: 2.6.2 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.2 lxml: 3.6.0 bs4: None html5lib: 0.999999999 httplib2: 0.9.2 apiclient: 1.5.1 sqlalchemy: 1.0.13 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 boto: 2.48.0 pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:26 (7 by maintainers)
Top GitHub Comments
Same bug with pd.to_json from a CSV file. CSV file is only 700mb, I can in fact change it to json the long way, but it gives a slightly different format than I would like. Pandas version is 0.23.4.
Hit the same bug with a proper jsonlines file of 13GB on macOS and Pandas 0.23.0. Please reopen the issue