Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wanted: Strawman proposals for new collections architecture

See original GitHub issue

A redesign of the standard library is on the roadmap for one of the next Scala versions (could be as early as 2.13). Some requirements for the redesign should be:

Normal usage of collections should stay the same. That is, the standard operations on collections should still be available under the same names. Some more exotic and advanced scenarios such as breakOut can be dropped if alternative ways to achieve the same functionality exist.
User-defined implementations of collections should port to the new library as far as is reasonable. We should allow some breakage here, if it is necessary to achieve other goals.
We should strive for simplicity in APIs and implementations.
We should strive to better separate interfaces from implementation and avoid the fragile base class problem, where too much gets inherited automatically.
We should try to simplify the inheritance graphs, in particular those of often-used collections such as lists.
We should improve the integration of strict and lazy collections by means of a better architecture for views. Views should avoid accidental forcing, and should omit transforms from collections that require such forcing. However, forcing is still needed to support aggregations.
We should try to integrate Java8 streams and/or have our own high-performance implementations for parallel collections.
We should generally be at least on par with current collections in what concerns efficiency. In particular, we should still allow specializations of collection operations for particular implementations. These optimizations should still work if the static collection type is abstract. E.g. an optimized implementation of ++ on ArrayBuffers should be called even if the static types of the operands of ++ are Iterables.
The design should be friendly to optimizations such as inlining and, possibly, more advanced whole program optimizations.

To gain experience it would be good to experiment with some prototypes that illustrate possible designs and can point to strengths and weaknesses. A full collections design is too heavy to fill this role. It takes too long to write and understand. As an alternative, I propose to implement just a subset of the required functionality. The subset should implementable in about 500 LOC or less and at the same time should be as representative as possible of the whole collections. The API should be the same for all strawman proposals.

After some experimentation, I came up with the following proposal for the API to be implemented by strawman proposals. There’s still some modest room to add things if people feel it is important.

Base Traits

Iterator
Iterable
Seq

For now we leave out maps and sets (which does not say they are not important). We gloss over the mutabile/immutable distinction. All collections are in the same package (this is only for the strawman, to keep things simple, not for the full collections, of course).

Iterable and Seq are considered collection types, but Iterator is not.

Collection Implementations

List                    
ListBuffer   
ArrayBuffer          
java.lang.String  
View

List is a simplified version of Scala lists. It’s an immutable collection favoring linear access patterns. List operations should be tail-recursive, hence the addition of ListBuffer to hold intermediate results. ArrayBuffer is a mutable collection that supports random access. String should demonstrate how the architecture integrates externally defined types. String should be seen by Scala as an immutable random access sequence of Char. Finally, views should be constructible over all other collections. They are immutable and lazy.

List, ListBuffer and ArrayBuffer should have companion objects that let one construct collections given their elements in the usual way:

List(1, 2, 3)
ListBuffer("a", "bc")
ArrayBuffer()

Collection operations

The following operations should be supported by all collections.

foldLeft
foldRight
isEmpty
head
iterator
view
to

foldLeft and foldRight are the basis of all sorts of aggregations. isEmpty and head exemplify tests and element projections. iterator and view construct iterators and views over a collection. collect is not part of the current Scala collections, but is useful to support views and Java 8 streams. It should construct a new collection of a given type from the elements of the receiver. An invocation pattern of to could be either of the following:

xs.to[List]
xs.to(List)

Strawman collections also should support the following transforms:

filter
map
flatMap
++
zip
partition
drop

map and flatMap together support all monadic operations. ++ and zip are examples of operations with multiple inputs. drop was chosen as a typical projection that yields a sub-collection. partition was chosen in addition filter because it exemplifies transforms with multiple outputs.

Strawman Seqs (but not necessarily Iterables or Views) should also support the following operations

length
apply
indexWhere
reverse

Sequences in a way are defined by their length and their indexing method apply. indexWhere is an example of a method that combines indices and sequences in an interesting way. reverse is an example of a transform that suggests an access pattern different from left-to-right.

ArrayBuffer and ListBuffer should in addition support the following mutation operations

+=
++=
trimStart

Required Optimizations

There are a number of possible specializations that the strawman architecture should support. Here are two examples:

xs.drop(n) on a list xs should take time proportional to n, the retained part of the list should not be copied.
xs ++ ys on array buffers xs and ys should be implemented by efficient array copy operations. s1 ++ s2 on strings should use native string concatenation.
partition should require only a single collection traversal if the underlying collection is strict (i.e., not a view).

These specializations should be performed independently of the static types of their arguments.

Why No Arrays?

Collections definitely should support arrays as they do now. In particular, arrays should have the same representation as in Java and all collection operations should be applicable to them. Arrays are left out of the strawman requirements because of the bulk their implementation would add. Even though no explicit implementation is demanded, we should still check all designs for how they would support arrays.

Issue Analytics

State:
Created 8 years ago
Comments:66 (28 by maintainers)

Top GitHub Comments

9reactions

SolomonSun2010commented, May 26, 2017

I really expect LazyList or Stream add these powerful and convenient methods : repeat/cycle, repeatedly/generate, iterate, unfold, resource, reject, like in Elixir and Clojure: https://hexdocs.pm/elixir/Stream.html#unfold/2 This symbol #:: is too low,strange to verbose,unreadable. Also, I hope add some xxxIndexed like in Kotlin Sequence: https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.sequences/index.html I think, one of the programming trend is the Lazy Sequences .

7reactions

balefrostcommented, Nov 16, 2015

As somebody who writes C# at work and Scala at home, there are some conveniences in the .NET collection API that I miss in Scala. I realize that most of the discussion here is related to the implementation details of the collections library, whereas my comments are more concerned with the user API of the collections library. Still, maybe these thoughts will be useful.

Suppose I have a list of pairs (to be interpreted as key/value pairs). I’d like to group the values by their corresponding keys, with the understanding that there may be duplicate keys. In Scala, I would do that like this (unless there’s a simpler way that I’m not seeing):

pairs.groupBy(_._1).mapValues(_.map(_._2))

Or, to be more verbose but perhaps also more efficient, I might do this:

pairs.foldLeft(Map.empty[K, Set[V]]) {
    case (acc, (k, v)) =>
        acc + (k -> acc.getOrElse(k, Nil) :+ v)
}

Whereas in C#, I could use an overload of GroupBy do this:

pairs.GroupBy(x => x.Item1, x => x.Item2)

The .Net library realizes that simultaneous mapping while grouping is a common use case, so they provide a way to do that. The C# version feels far more ergonomic to me than either Scala snippet.

Another nice feature of the .Net collection API is that all methods that iterate the collection with a callback have overloads whose callback function takes the current index. So while I would need to do this in Scala:

coll.zipWithIndex.map {
    case (el, idx) =>
        el * idx
}

I could do this in C#

coll.Select((el, i) => el * idx)

It’s easy enough for Select, SelectMany, and friends to also track the index while iterating, so they provide overloads exposing that functionality.

The .Net collection API includes many such conveniences. It’s clear that the designers put a lot of thought into it. And though having methods take multiple function parameters seems like a questionable idea, in practice, it works out surprisingly well.

I realize that I’m arguing for convenience, which goes somewhat against goal #3, above. I also realize that the sort of convenience methods that I’m talking about could be implemented by a third-party library. But I believe that the Scala collection library would be better for having these conveniences built-in.