Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected timeout from process

See original GitHub issue

Hi,

We have dotnetmonitor set up on ECS Fargate. Running in listen mode collecting metrics every X. Our set up is a single dotnetmonitor side car inside each launched task with many tasks being launched. It stops working for us on some tasks after a few hours with the following error:

{
    "Timestamp": "2022-04-15T06:50:03.0758604Z",
    "EventId": 52,
    "LogLevel": "Warning",
    "Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
    "Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
    "State": {
        "Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
        "processId": "6",
        "{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
    },
    "Scopes": []
}

Then all subsequent requests get this error:

{
    "Timestamp": "2022-04-15T06:55:01.6363199Z",
    "EventId": 1,
    "LogLevel": "Error",
    "Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
    "Message": "Request failed.",
    "Exception": "System.ArgumentException: Unable to discover a target process.    at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 713 --- End of stack trace from previous location ---    at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91",
    "State": {
        "Message": "Request failed.",
        "{OriginalFormat}": "Request failed."
    },
    "Scopes": [
        {
            "Message": "SpanId:5f73f4ec6a4c2a06, TraceId:6e3bec22534dca3eed9ae13c8150dc0c, ParentId:0d6726492bd0e999",
            "SpanId": "5f73f4ec6a4c2a06",
            "TraceId": "6e3bec22534dca3eed9ae13c8150dc0c",
            "ParentId": "0d6726492bd0e999"
        },
        {
            "Message": "ConnectionId:0HMGU731FOFDF",
            "ConnectionId": "0HMGU731FOFDF"
        },
        {
            "Message": "RequestPath:/livemetrics RequestId:0HMGU731FOFDF:00000002",
            "RequestId": "0HMGU731FOFDF:00000002",
            "RequestPath": "/livemetrics"
        },
        {
            "Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
            "ActionId": "cc79e4d4-794e-481f-8083-fb3f3c7b5ca5",
            "ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
        },
        {
            "Message": "ArtifactType:livemetrics",
            "ArtifactType": "livemetrics"
        }
    ]
}

Note the main container itself keeps on working just fine and is processing requests without any issues. Per metrics captured before the error I do not see any abnormal memory/cpu/etc usage compared to the other tasks where dotnet-monitor keeps on working.

Here is our ecs task definition (the dotnetmonitor config values are under ‘Environment’):

  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Cpu: !Ref TaskCpu
      Memory: !Ref TaskMemory
      NetworkMode: awsvpc
      ExecutionRoleArn: !Sub "arn:aws:iam::${AWS::AccountId}:role/ecsTaskExecutionRole"
      TaskRoleArn: !ImportValue AppServicesEcsTaskRoleArn
      RequiresCompatibilities:
        - FARGATE
      Volumes:
        - Name: tmp
      ContainerDefinitions:
        - Essential: true
          Name: appservices
          Image:
            !Sub
              - "${repository}:${image}"
              - repository: !ImportValue AppServicesEcrRepository
                image: !Ref TaskEcrImageTag
          Ulimits:
            - Name: nofile
              HardLimit: 65535
              SoftLimit: 65535
          PortMappings:
            - ContainerPort: 44392
              Protocol: tcp
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !ImportValue AppServicesEcsLogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref EnvironmentName
          LinuxParameters:
            InitProcessEnabled: true
            Capabilities:
              Add:
                - SYS_PTRACE
          StopTimeout: 120
          MountPoints:
            - ContainerPath: /tmp
              SourceVolume: tmp
          Environment:
            - Name: DOTNET_DiagnosticPorts
              Value: /tmp/port
          DependsOn:
            - ContainerName: dotnet-monitor
              Condition: START
        - Essential: true
          Name: dotnet-monitor
          Image:
            !Sub
            - "${repository}:${image}-dotnetmonitor"
            - repository: !ImportValue AppServicesEcrRepository
              image: !Ref TaskEcrImageTag
          MountPoints:
            - ContainerPath: /tmp
              SourceVolume: tmp
          Environment:
            - Name: Kestrel__Certificates__Default__Path
              Value: /tmp/cert.pfx
            - Name: DotnetMonitor_S3Bucket
              Value: !Sub '{{resolve:ssm:/appservices/${EnvironmentName}/integration.bulk.s3.bucket:1}}'
            - Name: DotnetMonitor_DefaultProcess__Filters__0__Key
              Value: ProcessName
            - Name: DotnetMonitor_DefaultProcess__Filters__0__Value
              Value: dotnet
            - Name: DotnetMonitor_DiagnosticPort__ConnectionMode
              Value: Listen
            - Name: DotnetMonitor_DiagnosticPort__EndpointName
              Value: /tmp/port
            - Name: DotnetMonitor_Storage__DumpTempFolder
              Value: /tmp
            - Name: DotnetMonitor_Egress__FileSystem__file__directoryPath
              Value: /tmp/gcdump
            - Name: DotnetMonitor_Egress__FileSystem__file__intermediateDirectoryPath
              Value: /tmp/gcdumptmp
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Type
              Value: EventCounter
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__ProviderName
              Value: System.Runtime
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__CounterName
              Value: working-set
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__GreaterThan
              Value: !Ref TaskMemoryAutoGCDump
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__SlidingWindowDuration
              Value: 00:00:05
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Type
              Value: CollectGCDump
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Name
              Value: GCDump
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Settings__Egress
              Value: file
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Type
              Value: Execute
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Path
              Value: /bin/sh
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Arguments
              Value: /app/gcdump.sh $(Actions.GCDump.EgressPath)
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCount
              Value: 1
            - Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCountSlidingWindowDuration
              Value: 03:00:00
          Secrets:
            - Name: DotnetMonitor_Authentication__MonitorApiKey__Subject
              ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.subject"
            - Name: DotnetMonitor_Authentication__MonitorApiKey__PublicKey
              ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.publickey"
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !ImportValue AppServicesEcsLogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref EnvironmentName

And dockerfile to customize the default dotnet monitor container:

FROM mcr.microsoft.com/dotnet/monitor:6

RUN apk add curl && \
    apk add jq && \
    apk add aws-cli && \
    apk add dos2unix

RUN adduser -s /bin/true -u 1000 -D -h /app app \
  && chown -R "app" "/app"

COPY --chown=app:app --chmod=500 gcdump.sh /app/gcdump.sh
RUN dos2unix /app/gcdump.sh

USER app

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:16 (3 by maintainers)

Top GitHub Comments

1reaction

gest01commented, Nov 28, 2022

@xsoheilalizadeh

I had also exactly the same problem : System.IO.EndOfStreamException: Attempted to read past the end of the stream.

/gcdump and /trace worked perfectly. Only with /dump the same error occurred.

It turned out that my resources for the target container were too tight, and the pod crashed immediately when i invoked the /dump endpoint (maybe some kind of Out Of Memory).

After I removed the resource requests/limits from the yaml everything worked as desired 😃

0reactions

jander-msftcommented, Dec 6, 2022

so this has to be dotnet-monitor bug.

That doesn’t prove very much in this case:

The dotnet-trace tool makes at most three diagnostic connections to the target process (one to start the event session, one to optionally resume the runtime, and one to end the event session).
The dotnet-monitor tool makes connections to the target application every three seconds to determine liveliness plus whatever operations you perform (listing processes, capturing artifacts, etc). If the liveliness check fails, then dotnet-monitor gives up on monitoring the process (because if the diagnostic channel doesn’t respond to a liveliness check, it isn’t going to respond to a more complicated operation).

Previous reports suggest that the target application is no longer monitored after about 30 minutes; that would be 600 liveliness probes (30 minutes / 3 seconds-per-liveliness probes) before failure to respond. That’s a lot more socket connections than what dotnet-trace does.

My current theory is that the target runtime doesn’t always respond correctly when a diagnostic connection is terminated without a command (e.g. the liveliness probe), but I haven’t been able to reproduce the issue yet.

If anyone can reproduce the problem and capture a dump of both dotnet-monitor and the target process, that would be very helpful.