question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add an `extract` method

See original GitHub issue

One common use-case for cheerio is to extract multiple values from a document, and store them in an object. Doing so manually currently isn’t a great experience. There are several packages built on top of cheerio that improve this: For example https://github.com/matthewmueller/x-ray and https://github.com/IonicaBizau/scrape-it. Commercial scraping providers also allow bulk extractions: https://www.scrapingbee.com/documentation/data-extraction/

We should add an API to make this use-case easier. The API should be along the lines of:

$.extract({
  // To get the `textContent` of an element, we can just supply a selector
  title: "title",

  // If we want to get results for more than a single element, we can use an array
  headings: ["h1, h2, h3, h4, h5, h6"],
  // If an array has more than one child, all of them will be queried for.
  // Complication: This should follow document-order.
  headings: ["h1", "h2", "h3", "h4", "h5", "h6"], // (equivalent to the above)

  // To get a property other than the `textContent`, we can pass a string to `out`. This will be passed to the `prop` method.
  links: [{ selector: "a", out: "href" }],

  // We can map over the received elements to customise the behaviour
  links: [
    {
      selector: "a",
      out(el, key, obj) {
        const $el = $(el);
        return { name: $el.text(), href: $el.prop("href") };
      },
    },
  ],

  // To get nested elements, we can pass a nested extract object. Selectors inside the nested extract object will be relative to the current scope.
  posts: [
    {
      selector: ".post",
      out: {
        title: ".title",
        body: ".body",
        link: { selector: "a :has(> .title)", out: "href" },
      },
    },
  ],

  // We can skip the selector in nested extract objects to reference the current scope.
  links: [
    {
      selector: ".post a",
      out: {
        href: { out: "href" },
        name: { out: "textContent" },
        // Equivalent — get the text content of the current scope
        name: "*",
      },
    },
  ],
});

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
fb55commented, Dec 15, 2022

This was just merged and a new release hasn’t been issued yet. I’m working through my list for remaining changes, so this hopefully won’t take long.

0reactions
mvasincommented, Dec 15, 2022

Hi and thanks for the great library!

I noticed the extract method in the docs https://cheerio.js.org/interfaces/CheerioAPI.html#extract and in the tests https://github.com/cheeriojs/cheerio/blob/dec7cdc9ad21a1fc5667a2ed015aba9ee3b47e5f/src/api/extract.spec.ts, but when I try to use it:

const $ = cheerio.load('<div>hello</div>')
$.extract({
  div: 'div',
})

cheerio blows up with

$.extract({
  ^

TypeError: $.extract is not a function

I’m using version 1.0.0-rc.12.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extract Method refactoring | ReSharper Documentation
Press Ctrl+Shift+R and then choose Extract Method. Right-click and choose Refactor | Extract Method from the context menu. Choose ReSharper | ...
Read more >
Extract Method - Refactoring.Guru
Create a new method and name it in a way that makes its purpose self-evident. Copy the relevant code fragment to your new...
Read more >
Extract a method refactoring - Visual Studio - Microsoft Learn
Right-click the code and select Refactor > Extract > Extract Method. Right-click the code, select the Quick Actions and Refactorings menu and ...
Read more >
Extract Method | JustCode Documentation - Telerik
Press Alt+Insert. From the pop-up menu select Extract Method. Enter a name for the new method. Extract Method Created.
Read more >
Extract Method in C# - One Refactoring to Rule Them All
How to perform Extract Method refactoring automatically using Visual Studio 2022 · 1. Select the code lines you want to put in a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found