Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problems with "?" in robots.txt

See original GitHub issue

In https://www.welt.de/robots.txt there are ? containing entries like Disallow: /*?config. Hence https://www.welt.de/test?config should be allowed but it is not. Whereas entries like Disallow: /*.xmli work properly and disallow https://www.welt.de/test.xmli. After my investigation I figured out that ? is the problematic character.

I use RobotstxtServer#allow("https://www.welt.de/test?config") for testing.

Issue Analytics

State:
Created 6 years ago
Comments:15 (2 by maintainers)

Top GitHub Comments

1reaction

sebastian-nagelcommented, Mar 22, 2018

I’ll keep this on the radar, and will add a unit test to crawler-commons’ robots.txt parser, just to make sure that it continues to work. Thanks!

0reactions

Chaiavicommented, Mar 22, 2018

Thank you.

On Thu, Mar 22, 2018 at 11:27 AM, Sebastian Nagel notifications@github.com wrote:

I’ll keep this on the radar, and will add a unit test to crawler-commons’ robots.txt parser, just to make sure that it continues to work. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/304#issuecomment-375231908, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrbW1ALA3rpmrjB8xS0nJo6K1-FNmSFks5tg26egaJpZM4StjHr .