Google Crawler can only get around 100 images instead of 1000
See original GitHub issueHi, when I used the searching URLs generated by feed()
function in GoogleFeeder
, I can only get around 100 images although the max_num=1000
. I find that all the URLs get the same 100 results as the first URL. It seems that the ijn
and start
params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?
def feed(self, keyword, offset, max_num, language=None, filters=None):
base_url = 'https://www.google.com/search?'
self.filter = self.get_filter()
filter_str = self.filter.apply(filters, sep=',')
for i in range(offset, offset + max_num, 100):
params = dict(
q=keyword,
ijn=int(i / 100),
start=i,
tbs=filter_str,
tbm='isch')
if language:
params['lr'] = 'lang_' + language
url = base_url + urlencode(params)
self.out_queue.put({'url': url, 'keyword': keyword, 'next_offset': i+100})
self.logger.debug('put url to url_queue: {}'.format(url))
Issue Analytics
- State:
- Created 3 years ago
- Comments:6
Top Results From Across the Web
Image not retrieved (crawl rate too slow) or (crawl pending)
It's possible that the crawl rate for your images is too slow. To address this, you can increase the crawl rate in Google...
Read more >Google's Matt Cutts: We Can Crawl More Than 100 Links On A ...
Google's Matt Cutts posted a video answering the old question, can Google crawl more than a 100 links on a specific page?
Read more >Optimize your crawling and indexing - Google Developers
Many questions about website architecture, crawling and indexing, and even ranking issues can be boiled down to one central issue: How easy is...
Read more >14 Top Reasons Why Google Isn't Indexing Your Site
The first reason why Google won't index your site is that you don't have a domain name. This could be because you're using...
Read more >Why can Google or any other search engine not crawl images?
They crawl the images but you have to optimise them according to their guidelines. Here are some guidelines from Google: 1. Don't embed...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Same issue for me It seems that the paging method is not working correctly and only the first page is processed . for example to crawl car images the URL of the first page is: https://www.google.com/search?q=car&ijn=0&start=0&tbm=isch this page is ok and the crawler can fetch around 100 images. for the next pages the URL is: https://www.google.com/search?q=car&ijn=1&start=100&tbm=isch https://www.google.com/search?q=car&ijn=2&start=200&tbm=isch … parsing these pages does not return any results. Also, I’ve checked these pages in my browser, and all return the same results of the first page.
It seems that Google’s algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint
date
argument iteratively like:Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector.