Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Plans for Data-Forge version 2

See original GitHub issue

This issue is to discuss plans for version 2.

These is just ideas for the moment. I haven’t started on this and am not sure when I will.

Plans for v2:

Minimize breaking changes
How do I make Data-Forge easy to use and easier to get started with?
- Lazy evaluation is good for performance, but it makes DF hard to understand, does lazy evaluation need to die?
- If lazy evaluation were removed, the internals of DF could be massively simplified (getting rid of all the iterables/iterators).
- If lazy evaluation were removed you could look at a Series or DataFrame to see the current data that’s in there (instead of say, having to call toArray).
- We could say that splitting so that it fits in memory should happen above DF and is not the responsibility of DF.
Move plugins to the same repo (plugins will be republished under the org, e.g. @data-forge/fs)
Revise, improve and integrate the documentation (supported by having all the code for plugins in the one repository)
Delegate all maths to a pluggable library.
- This means we can swap between floating-point and decimal maths (for the people who need that)
~~Better support for statistics (e.g. linear regressions, correlation, etc)~~ I’m already working through this in v1.
Revise and overhaul serialization (e.g. support serialization/deserialization of JavaScript date objects)
- Better support for mixed data types in columns (serializing the column type doesn’t work for this, might need to serialize per-element type, I like the way MongoDB serializes dates to JSON, “$date”).
~~Investigate replacing iterators with generator functions~~ I’ve investigated this now and it doesn’t seem possible.
~~Add map, filters and reduce functions~~ (this is done now), deprecate select and where functions (make it more JavaScript-like)
~~Support streaming data (e.g. for processing massive CSV files)~~
- Ideally DF would be async first and be used to define pipelines for async streaming data, but does async go against the goal of making DF easier to use? Is there a way that I can make it so that async usage is friendly?
- I’m now thinking that async and parallelisation are higher level concerns that exist above DF and are not DF’s responsibility.
Define a format/convention for running transformations (map?) and accumulations (reduce?) over a series / dataframe.
It would be great if somehow Series and DataFrame were integrated. Afterall DataFrame is just a Series with columns attached. Having seperate Series and DataFrame is good for easy to browse documentation, but it makes for a lot of duplicated code. If DataFrame could just derive from Series that would be quite nice, except they have differing functionality. This needs some thought. Stretch goals:
Better performance (Using Tensorflow.js ???)

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:16 (8 by maintainers)

Top GitHub Comments

2reactions

rat-mathesoncommented, Feb 17, 2021

Some great ideas there! I’m not well versed enough in DataForge to give strong suggestions but I have written a couple pipe/stream libraries in the past and have some thoughts.

RE:pluggable library - does this mean using something like n-api to facilitate faster operations on C++? If it was possible to do performant vector and matrix math with data forge, it would become a real contender so that folks don’t have to learn python or R. In conjunction with large datasets support, this would be a huge success
API design - A minimal library that is discoverable via intellisense and yet extendable without having to build a special version is ideal. A really great way to do this might be to keep DataForge core as being extremely minimal and then having a small number of apply/transform/or map function to carry out plugin operations. I know form your comments above that you are already thinking about how to do this. I’m curious as to how you imagine it but here are some thoughts I have.

Example:

// DF today
let series = myDf.getSeries('someCol');
return <SummaryStats>{
    // I haven't looked at the implementation but I imagine each of these series min/max has to evaluate
    // over the entire collection (meaning loading it into memory).  The future idea below loads in memory just once
    // and could easily be streamed so that it is only partially loaded
    min: series.min(),
    max: series.max(),
    ...
}

// The issue with the above is that 'series' needs to know all the statistical functions I want as a user 
// and have an implementation for them.  In addition, each call is a separate evaluation across the entire dataset

//Future idea??
import { SummaryStatsFactory } from 'simple-statistics/data-forge';
import { getXPercential } from './custom-operations/getXPercential'

myDf.getSeries('someCol')
    // I'm imagining that once summarizeRows is finally evaluated, it returns a new DataFrame with the summaries as columns
    // And that I may want to join that DF with something else
    .summarizeRows([
        // So still disoverable but plugins can be external.  Could write wrappers to populate packages so that
        // users don't need to learn new interfaces
        SummaryStatsFactory.getAverage({name:'SomeColAverage'}), 
        SummaryStatsFactory.getMinimum(),

        // add a custom function
        getXPercential({ percential:0.8, name:'SomeCol80thPercentile' }),
        ...
    ])

    // I added this additional summarizeRows call to illustrate that it could be lazily evaluated.  It could return an interface that
    // keeps taking summary operations until something forces an evaluation 
    .summarizeRows(SummaryStatsFactory.getMaximum());

There’s a lot to take away there but the main point I want to drive home is if it is possible to keep the core DataFrame and Series interface as simple as possible, and then use external libraries to do the manipulations.

Easier to learn DataForge because just need to learn the key functions to get started(The challenge is determining what functions are absolutely key)
Easy to extend (basically just export a function or group some together)
Related operations can be grouped into factors like an ML factory, a stats factory, IO factory, etc
Can use adapters for existing libraries so that users don’t have to learn a whole new set of functions…almost like @types/…

The naming I choose was poor. Pipe seems better but DataForge has a few kinds of piping operations (such as summarizing, and grouping), that makes it complicated.

Marketing/Project coordination - I’m pretty blown away by DF and that someone out there has written a book on working with data in JavaScript. In the JS sphere, you must be near the top in terms of having credentials for pushing a JavaScript stack for data exploration and manipulation. There is a void in the JS community and we need a good opinionated push showing data loading, exploration, and ML in JS along with good performance (and the performance part doesn’t seem possible…yet). You have the credentials such that you could certainly bring a group of people together and drive a coordinated effort to make JS a data science contender. I bet funding could even be possible if a good case was made

I’m pretty excited about that possibility. Imagine if DF 2 is comparable to pandas in terms of performance. There are more JS programmers than python programmers. And R is a true mess for readability. JS is better for visualization given its close connection to html. Plus, the transition for exploration to production might be easier in JS than in python and definitely easier compared to R, MatLab, etc.

RE: grouping…I haven’t used data forge enough to know if this is possible but in my streaming library, it had both group and ungroup functions. This was super useful in terms of grouping some things, doing some work on the group, and then going back to regular operations across the whole set. I’ll take a look at my API at some point and see how it relates to DF.

Added an example for future discussion in a separate thread here.

1reaction

nemosmithasfcommented, Feb 15, 2021

Adding my voice towards better support for large datasets

Top Results From Across the Web

Ashley Davis on Twitter: "I have huge plans for Data-Forge ...

I have huge plans for Data-Forge Notebook in 2021! ... Browser-based notebooks - Online version of Data-Forge Notebook (massive stretch goal!) 1. 2....

data-forge - Bountysource

This issue is to discuss plans for version 2. These is just ideas for the moment. I haven't started on this and am...

DataForge: Modular platform for data storage and analysis

Abstract. DataForge is a framework for automated data acquisition, storage and analysis based on modern achievements of applied programming.

Issues · data-forge/data-forge-ts · GitHub

Plans for Data-Forge version 2 · #108 opened Feb 13, 2021 · ashleydavis · 16 ; Reshape table using Pivot. #81 opened Jul...

data-forge - npm

JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.. Latest version: 1.9.6, last published: 6 months ago.