Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: drop Scala backend implementation in favor of zipkin-java

See original GitHub issue

The zipkin-java project was created out of issues #451 #463. It is now approaching version 1.0, and resulted in questions about the future of the scala implementation. This issue will enumerate opportunities, concerns and any action to take on this topic.

Pro

zipkin-java has more features
- It has a no-dependency library, for example in instrumentation projects, custom collectors or spark
- fine-grained logging of dependency linking and collection stages
- fine-grained, and tested health check system
- collector-metrics broken out by transport
- integration tests that invoke the packaged exec jar
- elasticsearch storage
Maintainers can do more, if only supporting one server framework.
- Currently, there’s a lot of distracting work, porting code and tests, and releasing both projects
- Docker and documentation is confusing when there is more than one authoritative process.
- By cutting scala, zipkin’s few maintainers we can progress more with the same time, and cause less confusion.
The java build is more stable due to less dependencies
- The scala project uses more dependencies including the unreliable maven.twttr.com repository which often breaks the build with http 503 errors.
Java implementation is made from better understood technology
- the server is based on spring-boot, which is a popular framework and well understood from a configuration POV.
- the scala process is hosted on a relatively rare framework, and literally loads .scala files to configure the process!
Java implementation can be deployed as a single process
- while not required, you can enable all features including kafka in the same process
Java implementation has more people participating in difficult topics
- Topics including modularity, dependency management and configuration have had fast progress in the java project.
- elasticsearch was written in <1 week and worked both in the stock server, and one based on Armeria a netty rpc service. This is in contrast to years of requests in scala with no action.

Con

No precanned profiles like “collector”
- Some may desire a profile like collector includes the ability to receive but not query spans.
Scala implementation battle-hardened at Twitter. Performance, availability implications.
- A lot of zipkin code isn’t used at twitter (Ex cassandra, kafka), but Finagle + Scribe is.
- Those already using zipkin collectors may have higher confidence in Finagle + Scribe vs FB swift (used in zipkin-java)

Issue Analytics

State:
Created 7 years ago
Reactions:14
Comments:24 (10 by maintainers)

Top GitHub Comments

2reactions

codefromthecryptcommented, Mar 30, 2016

These are valid concerns, though I’d like to mention a few things that all might not know. They impact the perception of supportability:

zipkin’s scala code has had numerous significant bugs, yet very few have contributed fixes to it.
very little of the scala code is actually used inside Twitter. For example, twitter doesn’t use cassandra.
there is little evidence of heavy tuning in the source history of the scala code
benchmarks in scala that have happened have been ad-hoc. For example, pyramid-zipkin adjusting parameters for kafka (these parameters exist in both java and scala projects)

The scala+finagle aspect has actually led directly to people not participating in the project, and also delayed features for extreme amounts of time. For example, elasticsearch was requested years ago, and had some false-starts. Elasticsearch was implemented in java in less than a week by someone who formerly had no experience with zipkin. There’s a lot of evidence we can dig up about this, and this was a primary motivation for the java project itself.

While both the scala and java processes can enable a feature like kafka, there’s no reason to believe that either choice means a resilience fail. Before, we had no choice for an all-in-one process: that significantly damaged the ability for people to learn zipkin quickly. IOTW, let’s not mistake choice for a topology mandate. It is already the case that folks have collectors in golang and whatnot. There’s nothing preventing someone from making a zipkin server that is only a collector. in fact, there already is one in spring-cloud-sleuth for rabbit.

zipkin-java had a scribe transport, but it was deleted because scribe is archived and folks involved in zipkin had bad experience with scribe. We can choose to re-enable that feature in java using facebook swift (history lesson: facebook made scribe and thrift, so this choice shouldn’t be scary)

Regardless of whether folks want a scala+finagle thing or not, it is important that zipkin lives. Right now, I spend a great deal of time on undifferentiated work in scala eventhough I get lots of help in java. There is a burden to maintaining zipkin and with the amount of people we have, it surely suffers from having to carry excess weight in relatively unknown frameworks.

2reactions

eirslettcommented, Mar 29, 2016

Java implementation has lower complexity in terms of technology used (TBD: how exactly?)

Java is a more widely known programming language than Scala, and the Spring framework/stack is more widely known than the Finagle/Twitter framework/stack. The technical complexity is about the same (I think), but more people know Java than Scala.

No built-in Scribe collector support, breaking feature parity

I think we should add scribe support to zipkin-java. The java port should be a drop-in replacement for the scala implementation, so we don’t need to make any changes to the instrumented applications.

Multi-process Zipkin provides extra resiliency

They [collector/query] operate independently, such that a degradation in one process does not affect the ability of the other to serve requests.

This could be solved in the infrastructure, by only sending query traffic to some instances, and only sending collector traffic to other instances. I’m not concerned about query workloads taking down a collector (it’s very low volume), but a collector which is receiving too many spans could possibly crash, and then the query functionality wouldn’t work either. So for extra resiliency, you could deploy one query-only instance that is guaranteed to answer, even when the collectors are flooded. (Unless the storage backend is also kneeling)

How do we know that an http-backed Zipkin tracer is as performant as Finagle’s scribe-backed Zipkin tracer?

We should write a couple of load tests:

finagle -> scribe -> zipkin-scala
finagle -> http -> zipkin-scala (if this is supported?)
finagle -> scribe -> zipkin-java
finagle -> http -> zipkin-java

and see how many spans per second we can process. Also, measure how latency in the storage backend affects server performance, and how much traffic the collector can take before it crashes.

moving the fundamental infra away from [~~scribe and~~] scala isn’t inherently problematic as much as it is labor intensive.

Given that Zipkin is distributed as a shaded jar, Scala is already an implementation detail from the infrastructure point of view; it’s just a jvm process, or even just a Docker container. But given that Twitter builds Zipkin from source, for Manhattan support (is that still true?) then there would be a rewrite effort, since the APIs would be Java, not Scala.