question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Internal] Client Telemetry Public API Proposal

See original GitHub issue

Background

As per latest discussion, to get align with Java API and avoid confusion with feature name. We decided to use “Client Telemetry” as a parent feature under which there will be 2 kind of telemetry covered as of now: 1. Send Client Diagnostic Metrics To Service : Sending telemetry metrics to Microsoft.

Read more about it here: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/observability.md#send-telemetry-from-sdk-to-service-private-preview

2. Distributed Tracing: Sending Activities with operation level or network level information to customer’s APM tool like AppInsights (May be in future metrics will also be covered).

Read more about it here: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/observability.md#distributed-tracing-preview

We are fine if Java API and .Net API doesn’t look exactly similar. Here idea is to make sure, it should not confuse customer.

Java Public API looks like this:

CosmosClientBuilder image

CosmosClientTelemetryConfig image

CosmosDiagnosticsThresholds image

Proposed Public API in .Net SDK to control above features:

By default, both of the above features will be enabled in SDK.

Why?

1. Send Client Diagnostic Metrics To Service: It is controlled by portal hence enabling it in SDK, by default, will give seamless experience to the customer where they can just enable it from portal and we can see data flowing to us without any code change. There won’t be any perf impact/overhead on sdk, if it is disabled on portal and enabled on SDK as Portal setting has higher priority. 2. Distributed Tracing: It works on subscription model, until unless there is no subscriber, it won’t generate any activity. Hence no overhead. This feature flag is introduced to support appinsight SDK as customer cannot de-subscribe any activity source there.

Want to see distributed tracing in action? OpenTelemetry Way: https://github.com/Azure/azure-cosmos-dotnet-v3/tree/master/Microsoft.Azure.Cosmos.Samples/Usage/OpenTelemetry AppInsight SDK Way: https://github.com/Azure/azure-cosmos-dotnet-v3/tree/master/Microsoft.Azure.Cosmos.Samples/Usage/ApplicationInsights

In both the cases, there is a scenario where customer would want to disable individual telemetry from SDK.

Public APIs

CosmosClientBuilder

Class Name Method Name Return Type Comment
CosmosClientBuilder WithClientTelemetryOptions(CosmosClientTelemetryOptions options) Returns the client telemetry config instance for this builder Single function to control/configure any kind of telemetry

CosmosClientOptions

Class Name Method Name Return Type Comment
CosmosClientOptions CosmosClientTelemetryOptions Set ClientTelemetry Options Option to control/configure any kind of telemetry

CosmosClientTelemetryOptions

Class Name Method Name Return Type Comment
CosmosClientTelemetryOptions DisableSendingMetricsToService() void Disable sending telemetry data to the service i.e. Microsoft
CosmosClientTelemetryOptions DisableDistributedTracing() PREVIEW void Disable Distributed Tracing feature, it means it will stop generating activities even if there are subscribers
CosmosClientTelemetryOptions CosmosThresholdOptions(CosmosThresholdOptions options) PREVIEW void Options to configure threshold for distributed tracing (Later we can use this same config for opentelemetry metrics, similar to Java)

CosmosThresholdOptions

Class Name Method Name Return Type Comment
CosmosThresholdOptions NonPointOperationLatencyThreshold(TimeSpan span) PREVIEW void Set when LatencyOverThrehold event with diagnostic string should be created for non-point operations
CosmosThresholdOptions PointOperationLatencyThreshold(TimeSpan span) PREVIEW void Set when LatencyOverThrehold event with diagnostic string should be created for point operations

Issue Analytics

  • State:open
  • Created 2 months ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
FabianMeiswinkelcommented, Jul 13, 2023

Please check whether you really want to go down the road of putting the latency thresholds behind a class named explicity for tracing only

In Java DiagnosticThresholds have more meaning than for tracing - and I think it has proven to be very useful

  • used even to decide whether to emit Request-Level metrics - metrics can have dimensions based on replicaId - allowing to also restrict when to emit these metrics allows to avoid overloading the metric system with metrics having way too high cardinality on dimensions
  • used for distributed tracing
  • and used for logging (probably not relevant for .Net)

So, main ask is to check whether you forsee any ask to also use the same thresholds for metrics - and of course whether .Net needs mroe grnularity on thresholds than just latency (java allows customizing whether to consider a threshold violation based on StatusCode+SubStatusCode), RU-usage, payload size and latency

2reactions
jcocchicommented, Jul 13, 2023

I recommend DisableSendingMetricsToService() instead of DisableClientTelemetryToService() to avoid overloading “Client Telemetry”. This also aligns better with the proposed feature name of “Send Client Diagnostic Metrics To Service”

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenTelemetry Enhancement Proposal (OTEP)
The OpenTelemetry OTEP process is intended for changes that are cross-cutting - that is, applicable across languages and implementations - and either introduce ......
Read more >
An Essential Guide to OpenTelemetry
Start your journey with OpenTelemetry to generate and collect traces, metrics and logs from your system, with this useful tutorial and reference hub....
Read more >
Application Insights API for custom events and metrics
You can send telemetry from device and desktop apps, web clients, ... Use the Application Insights core telemetry API to send custom events ......
Read more >
Calling an API proxy with internal-only access | Apigee
This document explains how to call API proxies for target services running on your internal network. Follow these steps if your Apigee organization...
Read more >
Confluent Telemetry Reporter
The Confluent Telemetry Reporter is a plugin that runs inside each Confluent Platform service to push metadata about the service to Confluent.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found