All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Unexplained latency when serving pages from RAM vs disk
Is there any logical explanation to the following observations:
I have three sets of content.
First set is served out of a tmpfs in RAM by nginx.
Second set is served off the disk by php (via the same nginx instance), and
Third set is served out of mysql via the same PHP and the same nginx.
As expected, the first set take 110 ms to download.
The second set take 120 ms, and
The third set of pages take the longest to download at 220 ms.
However, the first set sometimes (say once in 6 requests or so), show download times of between 500 to 1,200 ms, while the disk and DB pages are always consistent in their download times.
The RAM set comprises about 7,800 individual php files, totaling 650 MB.
The machine has 2 gb of RAM, and usually 1 GB is free and available, even after the 650 MB is used for tmps. Swap is not being used.
What could be the reason? Can retrieving from the RAM really take more time than retrieving from the disk? Or is it something related to the high number of tmpfs files in the RAM?
Comments
Is swap enabled or disabled on the machine? Are you sure there's no swapping/paging?
No. It's not swapping. The RAM pages take up about 650 MB of RAM, the rest of the system takes up about 400 MB and the remaining 950 MB is used by system for caching etc, and is therefore available.
Are you trying from the machine itself to download your content?
Otherwise, may be a network issue.
Is swap disabled completely or just not being used? In the latter case try to disable it completely.
No. From another server in another city (1000 miles apart).
Initially, I considered the network possibility. But that should impact the disk/DB pages as well, right? They're always consistent? Could it be that because RAM page requests are responded to almost immediately, some other kind of network latency/behavior is at play? Frankly, I tried to recreate the problem from another machine on the same lan, but wasn't able to.
the machine doesn't have any swap (htop shows 0K). I presume I don't need to disable it further.
Is this a VPS (if so what type of virtualization is being used)?
It's a T2 EC2 instance from AWS.
So, if I understand correctly, if trying from the same lan you get consistent latency, it's a network problem I guess?
It's probably the way the instance is dealing with the network, in the sense that only the RAM pages are affected, while the others are not.
The RAM pages are served by the front-end server, while the other pages are proxied to a backend FCGI server (also an EC2 instance).
If it was purely a network issue, there's no reason why only the pages served directly by the frontend proxy are affected and the proxied pages work smoothly.
Has the frontend proxy been running on the same physical machine this whole time?
Did you try moving the frontend proxy to a different physical machine (by stopping it then starting it)?
I think you just unwittinly described the cause yourself. tmpfs doesn't get cached by the OS plus the front server seems to be overwhelmed. Keep in mind that the fcgi server delivers nicely pre-digested content packets the front server just needs to push out. In your first case though that front server has to additionally do all the things that in the other cases the fcgi backends do. Also the fcgi backend has multiple cache levels like OS and itself.
It might be worth a test to try two things: how does it work out when you work with disk-based content and not tmpfs? And how does it change things if you simply tar the whole content in scenario one?
>
Could it really be getting overwhelmed? There are only around 100 active connections at any time. CPU usage is in the low single digits, and RAM usage is only 50% (the other 50% is used by the OS for keeping pages/caching).
Also, tmpfs doesn't need to be cached, right? It's already in RAM, so it's as good as cached? Or do OS-level caches work faster than tmpfs in terms of retrieval?
>
I was trying to make it easy for the frontend server by putting stuff in tmpfs. I thought it just had to locate the stuff and push it out, given that these are pre-gzipped php files.
>
I have limitations in serving it off the disk because then I'll have to for the costly IO1 type of disk (The site gets as much as a million PVs a month.. add to it the css and js files.. Images are CDNed)
On taring the whole content, it's already in gzipped state, and nginx has been instructed to send prezipped versions, instead of zipping it again. And it works too. Isn't that enough?
Do you think moving the tmpfs files to the FCGI machine's RAM and serving it out of there using an HTTP proxy on the frontend machine will help?
Yes
I can try that. Just stopping and restarting will change the underlying RAM/CPU?
"In most cases"
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html#instance_stop
I'm guessing the longer you wait between stop and start the more likely it is to be migrated.
Btw, this is a phenomenon that only affects my AWS deployments. When I deploy the same thing on Linode or DO or Softsys, this doesn't happen.
It's only on a single deployment, right? I'm just wondering if that node has a "noisy neighbor".
I actually did change the host once. Because the front end server was initially deployed on a 1 GB node, and then it ran out of RAM, so we took a snapshot and created a new 2 GB instance.
The problem of unexplained delays were present in both cases. Right now, my best guess is that when the server responds very quickly (almost instantly) with the content, somehow something is triggered that causes a delay or congestion. Interesting also that it is visible only when I test from the remote location. So this congestion could be related to the remote end as well. I'll test again from another remote location after spinning up a server somewhere and see.
If the problem is only with AWS why just not skip that provider and use whatever works flawless for you, like you said "Linode, DO, Softsys", etc.
Yes, tmpfs means it's already in memory but it also means that it's behind a file system and some OS layers unlike say a mysql cache which is under the applications control.
The tests I suggested were aiming to get a clearer picture. As you don't experience the problem with other hosters it's highly likely due to something aws specific, most probably either in their virtualization or in the kernel used. My suggestions were meant as the easy way but you can also directly use strace. Don't know though whether Amazons vmachines support that. If they do you can filter for the relevant calls and quickly find where the problem is.
strace is possible. But I figured out that the problem was with the machine that was making the calls, not at Amazon's end. Phew!
I do. Just wanted to try AWS to be closer to the audience, and their E5 2676 cpus are far more reliable, and about 30% more powerful than what DO offers in India.
What was the problem?
That's probably why - using T2 burstable or T2 Unlimited ? AWS EC2 T2 are burstable instances so performance can vary and not consistent https://aws.amazon.com/ec2/instance-types/#burst. Try using AWS EC2 non-T2 instances instead.
T2 base line is 1/5 of 1 cpu core for t2.small. For more info see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-credits-baseline-concepts.html