Unexpected timeout from process
See original GitHub issueHi,
We have dotnetmonitor set up on ECS Fargate. Running in listen mode collecting metrics every X. Our set up is a single dotnetmonitor side car inside each launched task with many tasks being launched. It stops working for us on some tasks after a few hours with the following error:
{
"Timestamp": "2022-04-15T06:50:03.0758604Z",
"EventId": 52,
"LogLevel": "Warning",
"Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
"Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
"State": {
"Message": "Unexpected timeout from process 6. Process will no longer be monitored.",
"processId": "6",
"{OriginalFormat}": "Unexpected timeout from process {processId}. Process will no longer be monitored."
},
"Scopes": []
}
Then all subsequent requests get this error:
{
"Timestamp": "2022-04-15T06:55:01.6363199Z",
"EventId": 1,
"LogLevel": "Error",
"Category": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController",
"Message": "Request failed.",
"Exception": "System.ArgumentException: Unable to discover a target process. at Microsoft.Diagnostics.Monitoring.WebApi.DiagnosticServices.GetProcessAsync(DiagProcessFilter processFilterConfig, CancellationToken token) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/DiagnosticServices.cs:line 100 at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.<>c__DisplayClass33_0`1.<<InvokeForProcess>b__0>d.MoveNext() in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagController.cs:line 713 --- End of stack trace from previous location --- at Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagControllerExtensions.InvokeService[T](ControllerBase controller, Func`1 serviceCall, ILogger logger) in /_/src/Microsoft.Diagnostics.Monitoring.WebApi/Controllers/DiagControllerExtensions.cs:line 91",
"State": {
"Message": "Request failed.",
"{OriginalFormat}": "Request failed."
},
"Scopes": [
{
"Message": "SpanId:5f73f4ec6a4c2a06, TraceId:6e3bec22534dca3eed9ae13c8150dc0c, ParentId:0d6726492bd0e999",
"SpanId": "5f73f4ec6a4c2a06",
"TraceId": "6e3bec22534dca3eed9ae13c8150dc0c",
"ParentId": "0d6726492bd0e999"
},
{
"Message": "ConnectionId:0HMGU731FOFDF",
"ConnectionId": "0HMGU731FOFDF"
},
{
"Message": "RequestPath:/livemetrics RequestId:0HMGU731FOFDF:00000002",
"RequestId": "0HMGU731FOFDF:00000002",
"RequestPath": "/livemetrics"
},
{
"Message": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)",
"ActionId": "cc79e4d4-794e-481f-8083-fb3f3c7b5ca5",
"ActionName": "Microsoft.Diagnostics.Monitoring.WebApi.Controllers.DiagController.CaptureMetrics (Microsoft.Diagnostics.Monitoring.WebApi)"
},
{
"Message": "ArtifactType:livemetrics",
"ArtifactType": "livemetrics"
}
]
}
Note the main container itself keeps on working just fine and is processing requests without any issues. Per metrics captured before the error I do not see any abnormal memory/cpu/etc usage compared to the other tasks where dotnet-monitor keeps on working.
Here is our ecs task definition (the dotnetmonitor config values are under ‘Environment’):
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Cpu: !Ref TaskCpu
Memory: !Ref TaskMemory
NetworkMode: awsvpc
ExecutionRoleArn: !Sub "arn:aws:iam::${AWS::AccountId}:role/ecsTaskExecutionRole"
TaskRoleArn: !ImportValue AppServicesEcsTaskRoleArn
RequiresCompatibilities:
- FARGATE
Volumes:
- Name: tmp
ContainerDefinitions:
- Essential: true
Name: appservices
Image:
!Sub
- "${repository}:${image}"
- repository: !ImportValue AppServicesEcrRepository
image: !Ref TaskEcrImageTag
Ulimits:
- Name: nofile
HardLimit: 65535
SoftLimit: 65535
PortMappings:
- ContainerPort: 44392
Protocol: tcp
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !ImportValue AppServicesEcsLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref EnvironmentName
LinuxParameters:
InitProcessEnabled: true
Capabilities:
Add:
- SYS_PTRACE
StopTimeout: 120
MountPoints:
- ContainerPath: /tmp
SourceVolume: tmp
Environment:
- Name: DOTNET_DiagnosticPorts
Value: /tmp/port
DependsOn:
- ContainerName: dotnet-monitor
Condition: START
- Essential: true
Name: dotnet-monitor
Image:
!Sub
- "${repository}:${image}-dotnetmonitor"
- repository: !ImportValue AppServicesEcrRepository
image: !Ref TaskEcrImageTag
MountPoints:
- ContainerPath: /tmp
SourceVolume: tmp
Environment:
- Name: Kestrel__Certificates__Default__Path
Value: /tmp/cert.pfx
- Name: DotnetMonitor_S3Bucket
Value: !Sub '{{resolve:ssm:/appservices/${EnvironmentName}/integration.bulk.s3.bucket:1}}'
- Name: DotnetMonitor_DefaultProcess__Filters__0__Key
Value: ProcessName
- Name: DotnetMonitor_DefaultProcess__Filters__0__Value
Value: dotnet
- Name: DotnetMonitor_DiagnosticPort__ConnectionMode
Value: Listen
- Name: DotnetMonitor_DiagnosticPort__EndpointName
Value: /tmp/port
- Name: DotnetMonitor_Storage__DumpTempFolder
Value: /tmp
- Name: DotnetMonitor_Egress__FileSystem__file__directoryPath
Value: /tmp/gcdump
- Name: DotnetMonitor_Egress__FileSystem__file__intermediateDirectoryPath
Value: /tmp/gcdumptmp
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Type
Value: EventCounter
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__ProviderName
Value: System.Runtime
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__CounterName
Value: working-set
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__GreaterThan
Value: !Ref TaskMemoryAutoGCDump
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Trigger__Settings__SlidingWindowDuration
Value: 00:00:05
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Type
Value: CollectGCDump
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Name
Value: GCDump
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__0__Settings__Egress
Value: file
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Type
Value: Execute
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Path
Value: /bin/sh
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Actions__1__Settings__Arguments
Value: /app/gcdump.sh $(Actions.GCDump.EgressPath)
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCount
Value: 1
- Name: DotnetMonitor_CollectionRules__HighMemoryRule__Limits__ActionCountSlidingWindowDuration
Value: 03:00:00
Secrets:
- Name: DotnetMonitor_Authentication__MonitorApiKey__Subject
ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.subject"
- Name: DotnetMonitor_Authentication__MonitorApiKey__PublicKey
ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/appservices/${EnvironmentName}/dotnetmonitor.publickey"
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !ImportValue AppServicesEcsLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref EnvironmentName
And dockerfile to customize the default dotnet monitor container:
FROM mcr.microsoft.com/dotnet/monitor:6
RUN apk add curl && \
apk add jq && \
apk add aws-cli && \
apk add dos2unix
RUN adduser -s /bin/true -u 1000 -D -h /app app \
&& chown -R "app" "/app"
COPY --chown=app:app --chmod=500 gcdump.sh /app/gcdump.sh
RUN dos2unix /app/gcdump.sh
USER app
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:16 (3 by maintainers)
Top Results From Across the Web
Unexpected timeout behaviour in Python processes ...
This means that whenever you use a queue you need to make sure that all items which have been put on the queue...
Read more >process.terminate() got an unexpected keyword argument ' ...
Indeed, timeout is not a valid parameter for that method, perhaps removing it should be enough to fix the issue. Thanks a lot!...
Read more >Timeout expired messages when connecting to SQL Server
A timeout error means that a certain operation takes longer than needed. The client application stops the operation (instead of waiting ...
Read more >Execution Timeout Expired. The timeout period elapsed ...
The timeout period elapsed prior to completion of the operation or the server is not responding. Normally it shouldm't take more than 2...
Read more >The role of the timeout in unexpected errors
After an execution of the timer is aborted due to an unexpected (uncaught) exception the OutSystems watchdog responsible for operating the ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@xsoheilalizadeh
I had also exactly the same problem : System.IO.EndOfStreamException: Attempted to read past the end of the stream.
/gcdump and /trace worked perfectly. Only with /dump the same error occurred.
It turned out that my resources for the target container were too tight, and the pod crashed immediately when i invoked the /dump endpoint (maybe some kind of Out Of Memory).
After I removed the resource requests/limits from the yaml everything worked as desired 😃
That doesn’t prove very much in this case:
dotnet-trace
tool makes at most three diagnostic connections to the target process (one to start the event session, one to optionally resume the runtime, and one to end the event session).dotnet-monitor
tool makes connections to the target application every three seconds to determine liveliness plus whatever operations you perform (listing processes, capturing artifacts, etc). If the liveliness check fails, then dotnet-monitor gives up on monitoring the process (because if the diagnostic channel doesn’t respond to a liveliness check, it isn’t going to respond to a more complicated operation).Previous reports suggest that the target application is no longer monitored after about 30 minutes; that would be 600 liveliness probes (30 minutes / 3 seconds-per-liveliness probes) before failure to respond. That’s a lot more socket connections than what
dotnet-trace
does.My current theory is that the target runtime doesn’t always respond correctly when a diagnostic connection is terminated without a command (e.g. the liveliness probe), but I haven’t been able to reproduce the issue yet.
If anyone can reproduce the problem and capture a dump of both dotnet-monitor and the target process, that would be very helpful.