Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

URLs without valid host name (and routing) stay DISCOVERED forever

See original GitHub issue

From time to time I have to clean up URLs which fail to fetch and stay DISCOVERED forever. These URLs (eg., http:/feeds/xml/latest.xml) are valid in terms of bot java.net.URL and java.net.URI but lack a valid host name (empty for URL, null for URI).

My first thought was that #704 also addresses this issue but that isn’t the case. The main difference: from the logs there is no evidence that any of these URLs without host name are tried to be fetched. The only log messages come from the

feed parser (when checking robots.txt before #700):

2019-04-09 06:19:37.927 c.d.s.p.RobotRulesParser Thread-24-feed-executor[4 7] [INFO] Couldn't get robots.txt for http:/feeds/xml/latest.xml : java.net.UnknownHostException: robots.txt: Name or service not known
2019-04-09 06:19:37.928 c.d.s.b.FeedParserBolt Thread-24-feed-executor[4 7] [INFO] Feed parser done http://sportdog.gr/feeds/xml/latest.xml

and StatusUpdaterBolt with debug logging:

2019-04-12 09:25:46.379 c.d.s.e.p.StatusUpdaterBolt Thread-16-status-executor[24 24] [DEBUG] Added to waitAck http:/feeds/xml/latest.xml with ID d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2 total 1
2019-04-12 09:25:46.379 c.d.s.e.p.StatusUpdaterBolt Thread-16-status-executor[24 24] [DEBUG] Sending to ES buffer http:/feeds/xml/latest.xml with ID d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2
2019-04-12 09:25:46.379 c.d.s.p.AdaptiveScheduler Thread-16-status-executor[24 24] [DEBUG] Scheduling status: DISCOVERED, metadata: discoveryDate: 2019-04-12T09:25:46.379Z

The invalid URL stems from an Atom feed:

    <entry>
        <title>Η Κωνσταντίνα Σπυροπούλου μας δείχνει τα κάλλη της με σέξι μπικίνι - Πρέπει να δεις αυτή τη φωτό!</title>
        <link rel="alternate" type="text/html" href="//#rurl_blhttp://newpost.gr/lifestyle/5caf3cec90e42f7a56d7db7a/i-konstantina-spyropoyloy-mas-deihnei-ta-kalli-tis-me-sexi-mpikini"/>
        <published>2019-04-11T17:09:00+00:00</published>

The mentioned URL is stored in the status index without a routing key (metadata.hostname):

    "hits" : [
      {
        "_index" : "status",
        "_type" : "status",
        "_id" : "d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2",
        "_score" : 0.2876821,
        "_source" : {
          "url" : "http:/feeds/xml/latest.xml",
          "status" : "DISCOVERED",
          "metadata" : {
            "url%2Epath" : [
              "http://sportdog.gr/feeds/xml/latest.xml"
            ],
            "depth" : [
              "1"
            ]
          },
          "nextFetchDate" : "2019-04-12T09:25:46.000Z"
        }
      }
    ]

One solution would be to filter these URLs away - if hostname is empty and protocol http/https - file URLs are allowed without host.

Alternatively, if the empty host or domain name should be allowed as routing key to make these items fail. Also (I haven’t checked it): isn’t the empty routing mandatory in a crawl which mixes http:// and file:// URLs?

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

sebastian-nagelcommented, Apr 15, 2019

The cross-domain filter has been disabled and later HostURLFilter has been entirely removed from the config. There can be cross-domain links from feeds, esp. sites “hosting” their feed on feedburner.com.

0reactions

jniochecommented, Apr 15, 2019

makes sense. we could add a check for a valid hostname in the basic URL filter or normaliser