Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Google Crawler can only get around 100 images instead of 1000

See original GitHub issue

Hi, when I used the searching URLs generated by feed() function in GoogleFeeder, I can only get around 100 images although the max_num=1000. I find that all the URLs get the same 100 results as the first URL. It seems that the ijn and start params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?

def feed(self, keyword, offset, max_num, language=None, filters=None):
        base_url = 'https://www.google.com/search?'
        self.filter = self.get_filter()
        filter_str = self.filter.apply(filters, sep=',')
        for i in range(offset, offset + max_num, 100):
            params = dict(
                q=keyword,
                ijn=int(i / 100),
                start=i,
                tbs=filter_str,
                tbm='isch')
            if language:
                params['lr'] = 'lang_' + language
            url = base_url + urlencode(params)
            self.out_queue.put({'url': url, 'keyword': keyword, 'next_offset': i+100})
            self.logger.debug('put url to url_queue: {}'.format(url))

Issue Analytics

State:
Created 3 years ago
Comments:6

Top GitHub Comments

7reactions

r-y-zadehcommented, Aug 20, 2020

Same issue for me It seems that the paging method is not working correctly and only the first page is processed . for example to crawl car images the URL of the first page is: https://www.google.com/search?q=car&ijn=0&start=0&tbm=isch this page is ok and the crawler can fetch around 100 images. for the next pages the URL is: https://www.google.com/search?q=car&ijn=1&start=100&tbm=isch https://www.google.com/search?q=car&ijn=2&start=200&tbm=isch … parsing these pages does not return any results. Also, I’ve checked these pages in my browser, and all return the same results of the first page.

0reactions

somisawacommented, Sep 29, 2022

It seems that Google’s algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint date argument iteratively like:

from icrawler.builtin import GoogleImageCrawler
import datetime

n_total_images = 10000
n_per_crawl = 100

delta = datetime.timedelta(days=30)
end_day = datetime.datetime(2022, 9, 29)

def datetime2tuple(date):
    return (date.year, date.month, date.day)

for i in range(int(n_total_images / n_per_crawl )):
    start_day = end_day - delta
    google_crawler = GoogleImageCrawler(downloader_threads=4, storage={'root_dir': '/path/to/image'})
    google_crawler.crawl(keyword='<YOUR_KEYWORDS>', filters={'date':(datetime2tuple(start_day), datetime2tuple(end_day))}, file_idx_offset=i*n_per_crawl , max_num=n_per_crawl)
    end_day = start_day - datetime.timedelta(days=1)

Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector.