Help with troubleshooting K6 timeout errors

I am having some issues troubleshooting some failed requests generated by one of our K6 scripts. The script is making a simple call to one of our Web APIs and consistently reports a failure rate of approx 1% when run at a constant arrival rate of 50/s. However the IIS logs (Windows Azure App Service) show no failures at all. The K6 failures return error codes of either 1211 or 1050.

Using the following code:-

  if (
    !check(translatorResponse, {
      'is status 200': (r) => { if (r.status !== 200) { console.log("Translator Status Code: " + r.status +  " k6 Error : " + r.error  + " k6 Error code: " + r.error_code + "after a duration of: " + r.timings.duration + "made up of: Blocked " + r.timings.blocked +  " Connecting " + r.timings.connecting + " Sending " + r.timings.sending + " Waiting: " + r.timings.waiting + " Receiving: " + r.timings.receiving); } return r.status === 200 },
    })
  ) {
    fail('status code was *not* 200');
  }

The output from a failed request is either:-

"Translator Status Code: 0 k6 Error : dial: i/o timeout k6 Error code: 1211after a duration of: 0made up of: Blocked 0 Connecting 0 Sending 0 Waiting: 0 Receiving: 0"

or something like:-

Translator Status Code: 0 k6 Error : request timeout k6 Error code: 1050after a duration of: 146091.375852made up of: Blocked 0.0003 Connecting 0 Sending 0.088702 Waiting: 146091.28715 Receiving: 0

I’ve set K6 timeout to 360s:

timeout:'360s'

The VM from which these tests are run is a Standard D4s v3 (4 vcpus, 16 GiB memory) running Ubuntu 20.04. The K6 tests seem to put very little load on the VM. I see a similar failure rate (with slightly different errors) when I run locally from my Windows laptop.

What is specifically confusing me is the absence of these errors in our App logs. This is leading me to believe that the issue is local to the K6 VM, but it seems like the VM is a long way from hitting any capacity thresholds. Naturally, we cannot have 1% of our requests failing under this type of load, so I need to find the underlying cause of these failures.

The only unusual aspect of our API calls is that the URL is dynamic and fairly long. Here’s an example:


https://URL_of_my_API.net/codes/latest?codes=["404179005","10879111000001105","19856811000001107","408027002","134460003","18474811000001106","10864711000001103","34025811000001103","36089811000001102","324712008","28934111000001103","35932311000001101","9608711000001102","3285011000001106","674711000001109","414606006","38679211000001109","10280311000001102","11379211000001109","12776511000001102","454911000001105","320006003","9432811000001101","23993211000001101","7887511000001107","30757911000001104","3152611000001100","35845111000001100","330079002","38335711000001105","321502002","16439011000001109","32409611000001106","17833111000001103","919911000001102","35566911000001104","16040911000001105","736911000001106","16089611000001103","4053411000001103","22104811000001106","35285711000001104","34026211000001105","33045411000001100","318860005","22752611000001101","6065411000001102","322236009","39873111000001106","30978611000001100","3183811000001101","22403511000001106","330721008","399911000001106","31278911000001104","30906411000001109","17299911000001102","318886000","4054111000001105","38354011000001108","9730611000001106","4525911000001109","100011000001108","330239000","5326911000001108","475111000001109","318007007","865711000001107","868411000001106","3710411000001100","16439411000001100","28090611000001101","25678911000001101","23316911000001108","8624811000001108","318969005","39197111000001103","325918007","31517311000001104","331661006","726611000001102","20153411000001106","14608311000001101","517511000001108","35338111000001109","19673911000001103","4615511000001102","28891211000001108","11569911000001106","29903311000001108","10332811000001108","571111000001101","17836811000001101","34977011000001108","35918611000001100","9847011000001101","18088711000001107","15242911000001102","14029111000001109","18232911000001106","319992003","24269211000001101","17665611000001109","27964311000001101","322708003","34203111000001107","27968111000001109","539211000001105","4621211000001109","33710311000001109","18718411000001100","18441211000001109","9310711000001107","38072211000001109","327104000","32762811000001106","20962211000001106","23323611000001100","28893811000001108","423111000001109","730611000001102","321816001","21236111000001100","5638311000001102","11524111000001101","19256211000001100","29909311000001106","18000411000001103","317541008","10613311000001109","243111000001108","21652711000001104","28958211000001104","407803003","22659911000001108","324570005","8031811000001102","19673011000001102","21240111000001104","801811000001101","30002711000001108","11619611000001108","3825411000001104","36595511000001104","10618211000001109","37931711000001108","35637211000001108","4711811000001109","35386211000001108","3654511000001105","34715811000001109","21716511000001105","3145811000001106","61711000001106","32196911000001105","19262111000001103","10236911000001107","376988009","3659911000001102","376807001","4427111000001108","48811000001103","3185211000001103","35847411000001104","30911611000001102","3881611000001107","20937911000001108","35846611000001103","35302711000001101","26655011000001108","24591111000001100","13532911000001104","8102611000001105","39115411000001101","16438111000001103","23632711000001107","29024211000001100","680511000001100","415557006","921211000001104","703907006","408597001","10863911000001103","29967711000001103","3112411000001105","330900004","377269004","21009511000001100","28784211000001107","324729006","37975711000001107","9354411000001100","327711000001105","11570111000001106","408036003","33635411000001108","31345911000001101","319941006","402311000001106","34898111000001108"]

The query string is generated within K6, pulling from txt files on the local file system.

Any support of guidance would be very much appreciated

Hi @NeilVingoe!

Welcome to the community forums! :wave:

It seems like in both cases, requests simply don’t reach the server. That’s why you see nothing in the IIS logs.

The code 1211 happens when the timeout for the dial was reached (DNS errors), and 1050 is when HTTP request has timed out. The slightly different errors on windows are because dial timeout doesn’t get returned there. So it seems like a network issue :thinking:

Is your API under load balancer? Is there any chance to run a test directly by IP?

Hi there, thanks so much for the response.
The application is running on an Azure App Service Plan currently scaled out to three instances, so there will be a LB in the backend managing traffic, but that’s not something that you can get at with an App Service. I agree that we seem to be dealing with 2 separate types of errors, but both are equally confusing to me. There’s nothing in the docs that I can see that suggests that 1211 is related to DNS. Could you share a link to this info? I note that 1050 is an HTTP timeout, how does that differ from the 1400 to 1499 error range? If these were HTTP timeouts, I would expect to see corresponding 408 response codes in the IIS logs, which I am not.

Sorry, my bad :man_facepalming:, for sure 1211 is from the range of TCP errors.

I note that 1050 is an HTTP timeout, how does that differ from the 1400 to 1499 error range?

The main difference here is that 1050 is happening on the client side. The k6 stops the outgoing HTTP request. But there is no guarantee that the server received it. That’s why no logs in the IIS.

1400-1499 errors are in the range of the errors that happened when k6 and the server established the connection. For example, if request something does not exist, we should see the translation of the 404 status, like:

INFO[0005] Translator Status Code: 404 k6 Error :  k6 Error code: 1404after a duration of: 0.825899made up of: Blocked 0.00712 Connecting 0 Sending 0.032801 Waiting: 0.714006 Receiving: 0.079092  source=console

408 Request Timeout also can happen only when a connection between k6 and server was established.

I’m not familiar with the Windows Azure platform, but is there any way to monitor the TCP connections? A quick googling bring me that link what if there are some limits?

Thanks very much for the additional clarification. If both error codes are client-side, further evidenced by the fact that there are no corresponding errors in the IIS logs, and the K6 docs indicate that a stand-alone K6 install can generate significantly more requests than our tests are generating, then it feels like the issue must be Network related. I’ve reached out to Azure Support for further guidance. I’ve looked at TCP connections and we are nowhere near the limits described in the Msft docs,

Thanks again for your help

Hi guys, I thought I would loop back and update this tread on how we eventually concluded this issue, as I suspect others may find this valuable. With help from Azure Support we discovered that our problem was SNAT Port Exhaustion. Apparently Azure VMs WITHOUT a Public IP have a max of 1,024 ports allocated. This is the limit we were hitting on our tests, although Msft do not expose any metrics to allow you to monitor this(!). If however you add a public IP to your VM, this increases the port limit to around 64k, which of course was plenty! As soon as I added a Public IP, our K6 tests returned zero error, even when I doubled arrival rate. When I disassociated the public IP from the VM, the errors returned. Case closed as far as I’m concerned.

Useful Microsoft docs here:

4 Likes

@NeilVingoe thanks a lot for the returning with explanations :bowing_man:

1 Like

No problem. It took a LOT of digging!

when we try to get a response from the server the script k6 will fixedly wait 60 sec. This cannot be changed yet.
The setupTimeout function has a maximum wait time of 4 minutes.
Waiting for a function to complete is not the same as waiting for a response from the server.

This may help you,
Rachel Gomez

Hi @RachelGomez161999,

when we try to get a response from the server the script k6 will fixedly wait 60 sec. This cannot be changed yet.

It can as explained in this thread. Whether it is a good idea for an API call to take over a minute is a different question ;).

Hope this helps you!

p.s. Please if you have more question, do open a new thread with your particular question instead of ressurecting old and alredy solved ones. You can reference others threads if you have questions around their solutions.