question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

URLs without valid host name (and routing) stay DISCOVERED forever

See original GitHub issue

From time to time I have to clean up URLs which fail to fetch and stay DISCOVERED forever. These URLs (eg., http:/feeds/xml/latest.xml) are valid in terms of bot java.net.URL and java.net.URI but lack a valid host name (empty for URL, null for URI).

My first thought was that #704 also addresses this issue but that isn’t the case. The main difference: from the logs there is no evidence that any of these URLs without host name are tried to be fetched. The only log messages come from the

  • feed parser (when checking robots.txt before #700):
2019-04-09 06:19:37.927 c.d.s.p.RobotRulesParser Thread-24-feed-executor[4 7] [INFO] Couldn't get robots.txt for http:/feeds/xml/latest.xml : java.net.UnknownHostException: robots.txt: Name or service not known
2019-04-09 06:19:37.928 c.d.s.b.FeedParserBolt Thread-24-feed-executor[4 7] [INFO] Feed parser done http://sportdog.gr/feeds/xml/latest.xml
  • and StatusUpdaterBolt with debug logging:
2019-04-12 09:25:46.379 c.d.s.e.p.StatusUpdaterBolt Thread-16-status-executor[24 24] [DEBUG] Added to waitAck http:/feeds/xml/latest.xml with ID d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2 total 1
2019-04-12 09:25:46.379 c.d.s.e.p.StatusUpdaterBolt Thread-16-status-executor[24 24] [DEBUG] Sending to ES buffer http:/feeds/xml/latest.xml with ID d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2
2019-04-12 09:25:46.379 c.d.s.p.AdaptiveScheduler Thread-16-status-executor[24 24] [DEBUG] Scheduling status: DISCOVERED, metadata: discoveryDate: 2019-04-12T09:25:46.379Z

The invalid URL stems from an Atom feed:

    <entry>
        <title>Η Κωνσταντίνα Σπυροπούλου μας δείχνει τα κάλλη της με σέξι μπικίνι - Πρέπει να δεις αυτή τη φωτό!</title>
        <link rel="alternate" type="text/html" href="//#rurl_blhttp://newpost.gr/lifestyle/5caf3cec90e42f7a56d7db7a/i-konstantina-spyropoyloy-mas-deihnei-ta-kalli-tis-me-sexi-mpikini"/>
        <published>2019-04-11T17:09:00+00:00</published>

The mentioned URL is stored in the status index without a routing key (metadata.hostname):

    "hits" : [
      {
        "_index" : "status",
        "_type" : "status",
        "_id" : "d9b44c50cbf08dc553acaab40fb8d9e58614e655c0738d19f2a481104d405ca2",
        "_score" : 0.2876821,
        "_source" : {
          "url" : "http:/feeds/xml/latest.xml",
          "status" : "DISCOVERED",
          "metadata" : {
            "url%2Epath" : [
              "http://sportdog.gr/feeds/xml/latest.xml"
            ],
            "depth" : [
              "1"
            ]
          },
          "nextFetchDate" : "2019-04-12T09:25:46.000Z"
        }
      }
    ]

One solution would be to filter these URLs away - if hostname is empty and protocol http/https - file URLs are allowed without host.

Alternatively, if the empty host or domain name should be allowed as routing key to make these items fail. Also (I haven’t checked it): isn’t the empty routing mandatory in a crawl which mixes http:// and file:// URLs?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sebastian-nagelcommented, Apr 15, 2019

The cross-domain filter has been disabled and later HostURLFilter has been entirely removed from the config. There can be cross-domain links from feeds, esp. sites “hosting” their feed on feedburner.com.

0reactions
jniochecommented, Apr 15, 2019

makes sense. we could add a check for a valid hostname in the basic URL filter or normaliser

Read more comments on GitHub >

github_iconTop Results From Across the Web

Argo Tunnels that live forever - The Cloudflare Blog
You can now create a Tunnel that has a persistent name. Run cloudflared tunnel create <name> to do so. The name does not...
Read more >
Fix Resolving Host Issue in Chrome Windows and Mac
How to fix resolving host issue in Google Chrome in Windows and Mac to load sites faster using public DNS servers and by...
Read more >
WordPress – Changing the Site URL and Home Settings
Has your wordpress website been broken after moving the files? This article goes over how to fix the issue and make your website...
Read more >
Types of Domain Redirects - 301, 302 URL ... - Namecheap
URL Redirect 301 is a permanent type of unmasked redirect. It should be used if your website was permanently moved to the new...
Read more >
How to Solve Frustrating 421 Misdirected Request Errors ...
Using name-based virtual hosts with SSL adds another layer of complication. Without the SNI extension, it's not generally possible (though a ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found