Plans for Data-Forge version 2
See original GitHub issueThis issue is to discuss plans for version 2.
These is just ideas for the moment. I haven’t started on this and am not sure when I will.
Plans for v2:
- Minimize breaking changes
- How do I make Data-Forge easy to use and easier to get started with?
- Lazy evaluation is good for performance, but it makes DF hard to understand, does lazy evaluation need to die?
- If lazy evaluation were removed, the internals of DF could be massively simplified (getting rid of all the iterables/iterators).
- If lazy evaluation were removed you could look at a Series or DataFrame to see the current data that’s in there (instead of say, having to call
toArray
). - We could say that splitting so that it fits in memory should happen above DF and is not the responsibility of DF.
- Move plugins to the same repo (plugins will be republished under the org, e.g. @data-forge/fs)
- Revise, improve and integrate the documentation (supported by having all the code for plugins in the one repository)
- Delegate all maths to a pluggable library.
- This means we can swap between floating-point and decimal maths (for the people who need that)
Better support for statistics (e.g. linear regressions, correlation, etc)I’m already working through this in v1.- Revise and overhaul serialization (e.g. support serialization/deserialization of JavaScript date objects)
- Better support for mixed data types in columns (serializing the column type doesn’t work for this, might need to serialize per-element type, I like the way MongoDB serializes dates to JSON, “$date”).
Investigate replacing iterators with generator functionsI’ve investigated this now and it doesn’t seem possible.Add map, filters and reduce functions(this is done now), deprecate select and where functions (make it more JavaScript-like)Support streaming data (e.g. for processing massive CSV files)Ideally DF would be async first and be used to define pipelines for async streaming data, but does async go against the goal of making DF easier to use? Is there a way that I can make it so that async usage is friendly?- I’m now thinking that async and parallelisation are higher level concerns that exist above DF and are not DF’s responsibility.
- Define a format/convention for running transformations (map?) and accumulations (reduce?) over a series / dataframe.
- It would be great if somehow Series and DataFrame were integrated. Afterall DataFrame is just a Series with columns attached. Having seperate Series and DataFrame is good for easy to browse documentation, but it makes for a lot of duplicated code. If DataFrame could just derive from Series that would be quite nice, except they have differing functionality. This needs some thought. Stretch goals:
- Better performance (Using Tensorflow.js ???)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:16 (8 by maintainers)
Top Results From Across the Web
Ashley Davis on Twitter: "I have huge plans for Data-Forge ...
I have huge plans for Data-Forge Notebook in 2021! ... Browser-based notebooks - Online version of Data-Forge Notebook (massive stretch goal!) 1. 2....
Read more >data-forge - Bountysource
This issue is to discuss plans for version 2. These is just ideas for the moment. I haven't started on this and am...
Read more >DataForge: Modular platform for data storage and analysis
Abstract. DataForge is a framework for automated data acquisition, storage and analysis based on modern achievements of applied programming.
Read more >Issues · data-forge/data-forge-ts · GitHub
Plans for Data-Forge version 2 · #108 opened Feb 13, 2021 · ashleydavis · 16 ; Reshape table using Pivot. #81 opened Jul...
Read more >data-forge - npm
JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.. Latest version: 1.9.6, last published: 6 months ago.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Some great ideas there! I’m not well versed enough in DataForge to give strong suggestions but I have written a couple pipe/stream libraries in the past and have some thoughts.
RE:pluggable library - does this mean using something like n-api to facilitate faster operations on C++? If it was possible to do performant vector and matrix math with data forge, it would become a real contender so that folks don’t have to learn python or R. In conjunction with large datasets support, this would be a huge success
API design - A minimal library that is discoverable via intellisense and yet extendable without having to build a special version is ideal. A really great way to do this might be to keep DataForge core as being extremely minimal and then having a small number of apply/transform/or map function to carry out plugin operations. I know form your comments above that you are already thinking about how to do this. I’m curious as to how you imagine it but here are some thoughts I have.
Example:
There’s a lot to take away there but the main point I want to drive home is if it is possible to keep the core DataFrame and Series interface as simple as possible, and then use external libraries to do the manipulations.
The naming I choose was poor. Pipe seems better but DataForge has a few kinds of piping operations (such as summarizing, and grouping), that makes it complicated.
I’m pretty excited about that possibility. Imagine if DF 2 is comparable to pandas in terms of performance. There are more JS programmers than python programmers. And R is a true mess for readability. JS is better for visualization given its close connection to html. Plus, the transition for exploration to production might be easier in JS than in python and definitely easier compared to R, MatLab, etc.
Added an example for future discussion in a separate thread here.
Adding my voice towards better support for large datasets