URLs without valid host name (and routing) stay DISCOVERED forever
See original GitHub issueFrom time to time I have to clean up URLs which fail to fetch and stay DISCOVERED forever. These URLs (eg., http:/feeds/xml/latest.xml
) are valid in terms of bot java.net.URL and java.net.URI but lack a valid host name (empty for URL, null for URI).
My first thought was that #704 also addresses this issue but that isn’t the case. The main difference: from the logs there is no evidence that any of these URLs without host name are tried to be fetched. The only log messages come from the
- feed parser (when checking robots.txt before #700):
2019-04-09 06:19:37.927 c.d.s.p.RobotRulesParser Thread-24-feed-executor[4 7] [INFO] Couldn't get robots.txt for http:/feeds/xml/latest.xml : java.net.UnknownHostException: robots.txt: Name or service not known
2019-04-09 06:19:37.928 c.d.s.b.FeedParserBolt Thread-24-feed-executor[4 7] [INFO] Feed parser done http://sportdog.gr/feeds/xml/latest.xml
- and StatusUpdaterBolt with debug logging:
2019-04-12 09:25:46.379 c.d.s.e.p.StatusUpdaterBolt Thread-16-status-executor[24 24] [DEBUG] Added to waitAck http:/feeds/xml/latest.xml with ID d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2 total 1
2019-04-12 09:25:46.379 c.d.s.e.p.StatusUpdaterBolt Thread-16-status-executor[24 24] [DEBUG] Sending to ES buffer http:/feeds/xml/latest.xml with ID d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2
2019-04-12 09:25:46.379 c.d.s.p.AdaptiveScheduler Thread-16-status-executor[24 24] [DEBUG] Scheduling status: DISCOVERED, metadata: discoveryDate: 2019-04-12T09:25:46.379Z
The invalid URL stems from an Atom feed:
<entry>
<title>Η Κωνσταντίνα Σπυροπούλου μας δείχνει τα κάλλη της με σέξι μπικίνι - Πρέπει να δεις αυτή τη φωτό!</title>
<link rel="alternate" type="text/html" href="//#rurl_blhttp://newpost.gr/lifestyle/5caf3cec90e42f7a56d7db7a/i-konstantina-spyropoyloy-mas-deihnei-ta-kalli-tis-me-sexi-mpikini"/>
<published>2019-04-11T17:09:00+00:00</published>
The mentioned URL is stored in the status index without a routing key (metadata.hostname
):
"hits" : [
{
"_index" : "status",
"_type" : "status",
"_id" : "d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2",
"_score" : 0.2876821,
"_source" : {
"url" : "http:/feeds/xml/latest.xml",
"status" : "DISCOVERED",
"metadata" : {
"url%2Epath" : [
"http://sportdog.gr/feeds/xml/latest.xml"
],
"depth" : [
"1"
]
},
"nextFetchDate" : "2019-04-12T09:25:46.000Z"
}
}
]
One solution would be to filter these URLs away - if hostname is empty and protocol http/https - file URLs are allowed without host.
Alternatively, if the empty host or domain name should be allowed as routing key to make these items fail. Also (I haven’t checked it): isn’t the empty routing mandatory in a crawl which mixes http:// and file:// URLs?
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
The cross-domain filter has been disabled and later HostURLFilter has been entirely removed from the config. There can be cross-domain links from feeds, esp. sites “hosting” their feed on
feedburner.com
.makes sense. we could add a check for a valid hostname in the basic URL filter or normaliser