question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problems with "?" in robots.txt

See original GitHub issue

In https://www.welt.de/robots.txt there are ? containing entries like Disallow: /*?config. Hence https://www.welt.de/test?config should be allowed but it is not. Whereas entries like Disallow: /*.xmli work properly and disallow https://www.welt.de/test.xmli. After my investigation I figured out that ? is the problematic character.

I use RobotstxtServer#allow("https://www.welt.de/test?config") for testing.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:15 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
sebastian-nagelcommented, Mar 22, 2018

I’ll keep this on the radar, and will add a unit test to crawler-commons’ robots.txt parser, just to make sure that it continues to work. Thanks!

0reactions
Chaiavicommented, Mar 22, 2018

Thank you.

On Thu, Mar 22, 2018 at 11:27 AM, Sebastian Nagel notifications@github.com wrote:

I’ll keep this on the radar, and will add a unit test to crawler-commons’ robots.txt parser, just to make sure that it continues to work. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/304#issuecomment-375231908, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW1ALA3rpmrjB8xS0nJo6K1-FNmSFks5tg26egaJpZM4StjHr .

Read more comments on GitHub >

github_iconTop Results From Across the Web

6 Common Robots.txt Issues & And How To Fix Them
1. Robots.txt Not In The Root Directory · 2. Poor Use Of Wildcards · 3. Noindex In Robots.txt · 4. Blocked Scripts And...
Read more >
14 Common Robots.txt Issues (and How to Avoid Them)
Robots.txt files inform search engine crawlers which pages or files the crawler can or can't request from your site. They also block user...
Read more >
Robot.txt SEO: Best Practices, Common Problems & Solutions
A broken or missing robots.txt file can also cause search engine crawlers to miss important pages on your website. If you have a...
Read more >
robots.txt is not valid - Chrome Developers
robots.txt is not valid · It can keep search engines from crawling public pages, causing your content to show up less often in...
Read more >
How to Fix the Problems with Drupal's Default Robots.txt File
Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found