Add an `extract` method
See original GitHub issueOne common use-case for cheerio is to extract multiple values from a document, and store them in an object. Doing so manually currently isn’t a great experience. There are several packages built on top of cheerio that improve this: For example https://github.com/matthewmueller/x-ray and https://github.com/IonicaBizau/scrape-it. Commercial scraping providers also allow bulk extractions: https://www.scrapingbee.com/documentation/data-extraction/
We should add an API to make this use-case easier. The API should be along the lines of:
$.extract({
// To get the `textContent` of an element, we can just supply a selector
title: "title",
// If we want to get results for more than a single element, we can use an array
headings: ["h1, h2, h3, h4, h5, h6"],
// If an array has more than one child, all of them will be queried for.
// Complication: This should follow document-order.
headings: ["h1", "h2", "h3", "h4", "h5", "h6"], // (equivalent to the above)
// To get a property other than the `textContent`, we can pass a string to `out`. This will be passed to the `prop` method.
links: [{ selector: "a", out: "href" }],
// We can map over the received elements to customise the behaviour
links: [
{
selector: "a",
out(el, key, obj) {
const $el = $(el);
return { name: $el.text(), href: $el.prop("href") };
},
},
],
// To get nested elements, we can pass a nested extract object. Selectors inside the nested extract object will be relative to the current scope.
posts: [
{
selector: ".post",
out: {
title: ".title",
body: ".body",
link: { selector: "a :has(> .title)", out: "href" },
},
},
],
// We can skip the selector in nested extract objects to reference the current scope.
links: [
{
selector: ".post a",
out: {
href: { out: "href" },
name: { out: "textContent" },
// Equivalent — get the text content of the current scope
name: "*",
},
},
],
});
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Extract Method refactoring | ReSharper Documentation
Press Ctrl+Shift+R and then choose Extract Method. Right-click and choose Refactor | Extract Method from the context menu. Choose ReSharper | ...
Read more >Extract Method - Refactoring.Guru
Create a new method and name it in a way that makes its purpose self-evident. Copy the relevant code fragment to your new...
Read more >Extract a method refactoring - Visual Studio - Microsoft Learn
Right-click the code and select Refactor > Extract > Extract Method. Right-click the code, select the Quick Actions and Refactorings menu and ...
Read more >Extract Method | JustCode Documentation - Telerik
Press Alt+Insert. From the pop-up menu select Extract Method. Enter a name for the new method. Extract Method Created.
Read more >Extract Method in C# - One Refactoring to Rule Them All
How to perform Extract Method refactoring automatically using Visual Studio 2022 · 1. Select the code lines you want to put in a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This was just merged and a new release hasn’t been issued yet. I’m working through my list for remaining changes, so this hopefully won’t take long.
Hi and thanks for the great library!
I noticed the
extract
method in the docs https://cheerio.js.org/interfaces/CheerioAPI.html#extract and in the tests https://github.com/cheeriojs/cheerio/blob/dec7cdc9ad21a1fc5667a2ed015aba9ee3b47e5f/src/api/extract.spec.ts, but when I try to use it:cheerio blows up with
I’m using version
1.0.0-rc.12
.