Unexplained latency when serving pages from RAM vs disk

elos42 · April 2018

Is there any logical explanation to the following observations:

I have three sets of content.

First set is served out of a tmpfs in RAM by nginx.
Second set is served off the disk by php (via the same nginx instance), and
Third set is served out of mysql via the same PHP and the same nginx.

As expected, the first set take 110 ms to download.
The second set take 120 ms, and
The third set of pages take the longest to download at 220 ms.

However, the first set sometimes (say once in 6 requests or so), show download times of between 500 to 1,200 ms, while the disk and DB pages are always consistent in their download times.

The RAM set comprises about 7,800 individual php files, totaling 650 MB.

The machine has 2 gb of RAM, and usually 1 GB is free and available, even after the 650 MB is used for tmps. Swap is not being used.

What could be the reason? Can retrieving from the RAM really take more time than retrieving from the disk? Or is it something related to the high number of tmpfs files in the RAM?

donli · April 2018

Swap is not being used.

Is swap enabled or disabled on the machine? Are you sure there's no swapping/paging?

elos42 · April 2018

No. It's not swapping. The RAM pages take up about 650 MB of RAM, the rest of the system takes up about 400 MB and the remaining 950 MB is used by system for caching etc, and is therefore available.

yomero · April 2018

Are you trying from the machine itself to download your content?

Otherwise, may be a network issue.

exception0x876 · April 2018

Is swap disabled completely or just not being used? In the latter case try to disable it completely.

elos42 · April 2018

No. From another server in another city (1000 miles apart).

Initially, I considered the network possibility. But that should impact the disk/DB pages as well, right? They're always consistent? Could it be that because RAM page requests are responded to almost immediately, some other kind of network latency/behavior is at play? Frankly, I tried to recreate the problem from another machine on the same lan, but wasn't able to.

elos42 · April 2018

the machine doesn't have any swap (htop shows 0K). I presume I don't need to disable it further.

@exception0x876 said:
Is swap disabled completely or just not being used? In the latter case try to disable it completely.

donli · April 2018

Is this a VPS (if so what type of virtualization is being used)?

elos42 · April 2018

It's a T2 EC2 instance from AWS.

yomero · April 2018

elos42 said: Frankly, I tried to recreate the problem from another machine on the same lan, but wasn't able to.

So, if I understand correctly, if trying from the same lan you get consistent latency, it's a network problem I guess?

elos42 · April 2018

It's probably the way the instance is dealing with the network, in the sense that only the RAM pages are affected, while the others are not.

The RAM pages are served by the front-end server, while the other pages are proxied to a backend FCGI server (also an EC2 instance).

@yomero said:

elos42 said: Frankly, I tried to recreate the problem from another machine on the same lan, but wasn't able to.

So, if trying from the same lan you get consistent latency, it's a network problem I guess?

elos42 · April 2018

If it was purely a network issue, there's no reason why only the pages served directly by the frontend proxy are affected and the proxied pages work smoothly.

donli · April 2018

Has the frontend proxy been running on the same physical machine this whole time?

Did you try moving the frontend proxy to a different physical machine (by stopping it then starting it)?

jsg · April 2018

@elos42 said:
The RAM pages are served by the front-end server, while the other pages are proxied to a backend FCGI server (also an EC2 instance).

I think you just unwittinly described the cause yourself. tmpfs doesn't get cached by the OS plus the front server seems to be overwhelmed. Keep in mind that the fcgi server delivers nicely pre-digested content packets the front server just needs to push out. In your first case though that front server has to additionally do all the things that in the other cases the fcgi backends do. Also the fcgi backend has multiple cache levels like OS and itself.

It might be worth a test to try two things: how does it work out when you work with disk-based content and not tmpfs? And how does it change things if you simply tar the whole content in scenario one?

elos42 · April 2018

I think you just unwittinly described the cause yourself. tmpfs doesn't get cached by the OS plus the front server seems to be overwhelmed.

>
Could it really be getting overwhelmed? There are only around 100 active connections at any time. CPU usage is in the low single digits, and RAM usage is only 50% (the other 50% is used by the OS for keeping pages/caching).

Also, tmpfs doesn't need to be cached, right? It's already in RAM, so it's as good as cached? Or do OS-level caches work faster than tmpfs in terms of retrieval?

In your first case though that front server has to additionally do all the things that in the other cases the fcgi backends do.

>

I was trying to make it easy for the frontend server by putting stuff in tmpfs. I thought it just had to locate the stuff and push it out, given that these are pre-gzipped php files.
>

It might be worth a test to try two things: how does it work out when you work with disk-based content and not tmpfs? And how does it change things if you simply tar the whole content in scenario one?

I have limitations in serving it off the disk because then I'll have to for the costly IO1 type of disk (The site gets as much as a million PVs a month.. add to it the css and js files.. Images are CDNed)

On taring the whole content, it's already in gzipped state, and nginx has been instructed to send prezipped versions, instead of zipping it again. And it works too. Isn't that enough?

elos42 · April 2018

Do you think moving the tmpfs files to the FCGI machine's RAM and serving it out of there using an HTTP proxy on the frontend machine will help?

elos42 · April 2018

@donli said:
Has the frontend proxy been running on the same physical machine this whole time?

Yes

Did you try moving the frontend proxy to a different physical machine (by stopping it then starting it)?

I can try that. Just stopping and restarting will change the underlying RAM/CPU?

donli · April 2018

@elos42 said:

I can try that. Just stopping and restarting will change the underlying RAM/CPU?

"In most cases"

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html#instance_stop

When you stop a running instance, the following happens:

In most cases, the instance is migrated to a new underlying host computer when it's started.

I'm guessing the longer you wait between stop and start the more likely it is to be migrated.

elos42 · April 2018

Btw, this is a phenomenon that only affects my AWS deployments. When I deploy the same thing on Linode or DO or Softsys, this doesn't happen.

donli · April 2018

@elos42 said:
Btw, this is a phenomenon that only affects my AWS deployments. When I deploy the same thing on Linode or DO or Softsys, this doesn't happen.

It's only on a single deployment, right? I'm just wondering if that node has a "noisy neighbor".

elos42 · April 2018

I actually did change the host once. Because the front end server was initially deployed on a 1 GB node, and then it ran out of RAM, so we took a snapshot and created a new 2 GB instance.

The problem of unexplained delays were present in both cases. Right now, my best guess is that when the server responds very quickly (almost instantly) with the content, somehow something is triggered that causes a delay or congestion. Interesting also that it is visible only when I test from the remote location. So this congestion could be related to the remote end as well. I'll test again from another remote location after spinning up a server somewhere and see.

Hxxx · April 2018

If the problem is only with AWS why just not skip that provider and use whatever works flawless for you, like you said "Linode, DO, Softsys", etc.

jsg · April 2018

@elos42 said:

I think you just unwittinly described the cause yourself. tmpfs doesn't get cached by the OS plus the front server seems to be overwhelmed.

>
Could it really be getting overwhelmed? There are only around 100 active connections at any time. CPU usage is in the low single digits, and RAM usage is only 50% (the other 50% is used by the OS for keeping pages/caching).

Also, tmpfs doesn't need to be cached, right? It's already in RAM, so it's as good as cached? Or do OS-level caches work faster than tmpfs in terms of retrieval?

In your first case though that front server has to additionally do all the things that in the other cases the fcgi backends do.

>

I was trying to make it easy for the frontend server by putting stuff in tmpfs. I thought it just had to locate the stuff and push it out, given that these are pre-gzipped php files.
>

It might be worth a test to try two things: how does it work out when you work with disk-based content and not tmpfs? And how does it change things if you simply tar the whole content in scenario one?

I have limitations in serving it off the disk because then I'll have to for the costly IO1 type of disk (The site gets as much as a million PVs a month.. add to it the css and js files.. Images are CDNed)

On taring the whole content, it's already in gzipped state, and nginx has been instructed to send prezipped versions, instead of zipping it again. And it works too. Isn't that enough?

Yes, tmpfs means it's already in memory but it also means that it's behind a file system and some OS layers unlike say a mysql cache which is under the applications control.

The tests I suggested were aiming to get a clearer picture. As you don't experience the problem with other hosters it's highly likely due to something aws specific, most probably either in their virtualization or in the kernel used. My suggestions were meant as the easy way but you can also directly use strace. Don't know though whether Amazons vmachines support that. If they do you can filter for the relevant calls and quickly find where the problem is.

elos42 · April 2018

strace is possible. But I figured out that the problem was with the machine that was making the calls, not at Amazon's end. Phew!

elos42 · April 2018

I do. Just wanted to try AWS to be closer to the audience, and their E5 2676 cpus are far more reliable, and about 30% more powerful than what DO offers in India.

@Hxxx said:
If the problem is only with AWS why just not skip that provider and use whatever works flawless for you, like you said "Linode, DO, Softsys", etc.

donli · April 2018

@elos42 said:
strace is possible. But I figured out that the problem was with the machine that was making the calls, not at Amazon's end. Phew!

What was the problem?

eva2000 · April 2018

@elos42 said:
It's a T2 EC2 instance from AWS.

That's probably why - using T2 burstable or T2 Unlimited ? AWS EC2 T2 are burstable instances so performance can vary and not consistent https://aws.amazon.com/ec2/instance-types/#burst. Try using AWS EC2 non-T2 instances instead.

T2 instances’ baseline performance and ability to burst are governed by CPU Credits. Each T2 instance receives CPU Credits continuously, the rate of which depends on the instance size. T2 instances accrue CPU Credits when they are idle, and use CPU credits when they are active. A CPU Credit provides the performance of a full CPU core for one minute.

For example, a t2.small instance receives credits continuously at a rate of 12 CPU Credits per hour. This capability provides baseline performance equivalent to 20% of a CPU core (20% x 60 mins = 12 mins). If the instance does not use the credits it receives, they are stored in its CPU Credit balance up to a maximum of 288 CPU Credits. When the t2.small instance needs to burst to more than 20% of a core, it draws from its CPU Credit balance to handle this surge automatically.

T2 base line is 1/5 of 1 cpu core for t2.small. For more info see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-credits-baseline-concepts.html

T2 nano baseline = 5% of 1 cpu core
T2 micro baseline = 10% of 1 cpu core
T2 small baseline = 20% of 1 cpu core
T2 medium baseline = 40% of 2 cpu cores
T2 large baseline = 60% of 2 cpu cores
T2 xlarge baseline = 90% of 4 cpu cores
T2 2xlarge baseline = 135% of 8 cpu cores

Howdy, Stranger!

Categories

In this Discussion

Unexplained latency when serving pages from RAM vs disk

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Unexplained latency when serving pages from RAM vs disk

Comments