Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Unexplained latency when serving pages from RAM vs disk
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Unexplained latency when serving pages from RAM vs disk

elos42elos42 Member

Is there any logical explanation to the following observations:

I have three sets of content.

First set is served out of a tmpfs in RAM by nginx.
Second set is served off the disk by php (via the same nginx instance), and
Third set is served out of mysql via the same PHP and the same nginx.

As expected, the first set take 110 ms to download.
The second set take 120 ms, and
The third set of pages take the longest to download at 220 ms.

However, the first set sometimes (say once in 6 requests or so), show download times of between 500 to 1,200 ms, while the disk and DB pages are always consistent in their download times.

The RAM set comprises about 7,800 individual php files, totaling 650 MB.

The machine has 2 gb of RAM, and usually 1 GB is free and available, even after the 650 MB is used for tmps. Swap is not being used.

What could be the reason? Can retrieving from the RAM really take more time than retrieving from the disk? Or is it something related to the high number of tmpfs files in the RAM?

Comments

  • donlidonli Member
    edited April 2018

    Swap is not being used.

    Is swap enabled or disabled on the machine? Are you sure there's no swapping/paging?

  • No. It's not swapping. The RAM pages take up about 650 MB of RAM, the rest of the system takes up about 400 MB and the remaining 950 MB is used by system for caching etc, and is therefore available.

  • Are you trying from the machine itself to download your content?

    Otherwise, may be a network issue.

  • exception0x876exception0x876 Member, Host Rep, LIR

    Is swap disabled completely or just not being used? In the latter case try to disable it completely.

  • No. From another server in another city (1000 miles apart).

    Initially, I considered the network possibility. But that should impact the disk/DB pages as well, right? They're always consistent? Could it be that because RAM page requests are responded to almost immediately, some other kind of network latency/behavior is at play? Frankly, I tried to recreate the problem from another machine on the same lan, but wasn't able to.

  • elos42elos42 Member
    edited April 2018

    the machine doesn't have any swap (htop shows 0K). I presume I don't need to disable it further.

    @exception0x876 said:
    Is swap disabled completely or just not being used? In the latter case try to disable it completely.

  • donlidonli Member

    Is this a VPS (if so what type of virtualization is being used)?

  • elos42elos42 Member
    edited April 2018

    It's a T2 EC2 instance from AWS.

  • yomeroyomero Member
    edited April 2018

    elos42 said: Frankly, I tried to recreate the problem from another machine on the same lan, but wasn't able to.

    So, if I understand correctly, if trying from the same lan you get consistent latency, it's a network problem I guess?

  • It's probably the way the instance is dealing with the network, in the sense that only the RAM pages are affected, while the others are not.

    The RAM pages are served by the front-end server, while the other pages are proxied to a backend FCGI server (also an EC2 instance).

    @yomero said:

    elos42 said: Frankly, I tried to recreate the problem from another machine on the same lan, but wasn't able to.

    So, if trying from the same lan you get consistent latency, it's a network problem I guess?

  • If it was purely a network issue, there's no reason why only the pages served directly by the frontend proxy are affected and the proxied pages work smoothly.

  • donlidonli Member

    Has the frontend proxy been running on the same physical machine this whole time?

    Did you try moving the frontend proxy to a different physical machine (by stopping it then starting it)?

  • jsgjsg Member, Resident Benchmarker
    edited April 2018

    @elos42 said:
    The RAM pages are served by the front-end server, while the other pages are proxied to a backend FCGI server (also an EC2 instance).

    I think you just unwittinly described the cause yourself. tmpfs doesn't get cached by the OS plus the front server seems to be overwhelmed. Keep in mind that the fcgi server delivers nicely pre-digested content packets the front server just needs to push out. In your first case though that front server has to additionally do all the things that in the other cases the fcgi backends do. Also the fcgi backend has multiple cache levels like OS and itself.

    It might be worth a test to try two things: how does it work out when you work with disk-based content and not tmpfs? And how does it change things if you simply tar the whole content in scenario one?

  • I think you just unwittinly described the cause yourself. tmpfs doesn't get cached by the OS plus the front server seems to be overwhelmed.

    >
    Could it really be getting overwhelmed? There are only around 100 active connections at any time. CPU usage is in the low single digits, and RAM usage is only 50% (the other 50% is used by the OS for keeping pages/caching).

    Also, tmpfs doesn't need to be cached, right? It's already in RAM, so it's as good as cached? Or do OS-level caches work faster than tmpfs in terms of retrieval?

    In your first case though that front server has to additionally do all the things that in the other cases the fcgi backends do.

    >

    I was trying to make it easy for the frontend server by putting stuff in tmpfs. I thought it just had to locate the stuff and push it out, given that these are pre-gzipped php files.
    >

    It might be worth a test to try two things: how does it work out when you work with disk-based content and not tmpfs? And how does it change things if you simply tar the whole content in scenario one?

    I have limitations in serving it off the disk because then I'll have to for the costly IO1 type of disk (The site gets as much as a million PVs a month.. add to it the css and js files.. Images are CDNed)

    On taring the whole content, it's already in gzipped state, and nginx has been instructed to send prezipped versions, instead of zipping it again. And it works too. Isn't that enough?

  • Do you think moving the tmpfs files to the FCGI machine's RAM and serving it out of there using an HTTP proxy on the frontend machine will help?

  • @donli said:
    Has the frontend proxy been running on the same physical machine this whole time?

    Yes

    Did you try moving the frontend proxy to a different physical machine (by stopping it then starting it)?

    I can try that. Just stopping and restarting will change the underlying RAM/CPU?

  • donlidonli Member
    edited April 2018

    @elos42 said:

    I can try that. Just stopping and restarting will change the underlying RAM/CPU?

    "In most cases"

    https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html#instance_stop

    When you stop a running instance, the following happens:

    In most cases, the instance is migrated to a new underlying host computer when it's started.

    I'm guessing the longer you wait between stop and start the more likely it is to be migrated.

  • Btw, this is a phenomenon that only affects my AWS deployments. When I deploy the same thing on Linode or DO or Softsys, this doesn't happen.

  • donlidonli Member
    edited April 2018

    @elos42 said:
    Btw, this is a phenomenon that only affects my AWS deployments. When I deploy the same thing on Linode or DO or Softsys, this doesn't happen.

    It's only on a single deployment, right? I'm just wondering if that node has a "noisy neighbor".

  • elos42elos42 Member
    edited April 2018

    I actually did change the host once. Because the front end server was initially deployed on a 1 GB node, and then it ran out of RAM, so we took a snapshot and created a new 2 GB instance.

    The problem of unexplained delays were present in both cases. Right now, my best guess is that when the server responds very quickly (almost instantly) with the content, somehow something is triggered that causes a delay or congestion. Interesting also that it is visible only when I test from the remote location. So this congestion could be related to the remote end as well. I'll test again from another remote location after spinning up a server somewhere and see.

  • HxxxHxxx Member

    If the problem is only with AWS why just not skip that provider and use whatever works flawless for you, like you said "Linode, DO, Softsys", etc.

  • jsgjsg Member, Resident Benchmarker

    @elos42 said:

    I think you just unwittinly described the cause yourself. tmpfs doesn't get cached by the OS plus the front server seems to be overwhelmed.

    >
    Could it really be getting overwhelmed? There are only around 100 active connections at any time. CPU usage is in the low single digits, and RAM usage is only 50% (the other 50% is used by the OS for keeping pages/caching).

    Also, tmpfs doesn't need to be cached, right? It's already in RAM, so it's as good as cached? Or do OS-level caches work faster than tmpfs in terms of retrieval?

    In your first case though that front server has to additionally do all the things that in the other cases the fcgi backends do.

    >

    I was trying to make it easy for the frontend server by putting stuff in tmpfs. I thought it just had to locate the stuff and push it out, given that these are pre-gzipped php files.
    >

    It might be worth a test to try two things: how does it work out when you work with disk-based content and not tmpfs? And how does it change things if you simply tar the whole content in scenario one?

    I have limitations in serving it off the disk because then I'll have to for the costly IO1 type of disk (The site gets as much as a million PVs a month.. add to it the css and js files.. Images are CDNed)

    On taring the whole content, it's already in gzipped state, and nginx has been instructed to send prezipped versions, instead of zipping it again. And it works too. Isn't that enough?

    Yes, tmpfs means it's already in memory but it also means that it's behind a file system and some OS layers unlike say a mysql cache which is under the applications control.

    The tests I suggested were aiming to get a clearer picture. As you don't experience the problem with other hosters it's highly likely due to something aws specific, most probably either in their virtualization or in the kernel used. My suggestions were meant as the easy way but you can also directly use strace. Don't know though whether Amazons vmachines support that. If they do you can filter for the relevant calls and quickly find where the problem is.

  • strace is possible. But I figured out that the problem was with the machine that was making the calls, not at Amazon's end. Phew!

  • I do. Just wanted to try AWS to be closer to the audience, and their E5 2676 cpus are far more reliable, and about 30% more powerful than what DO offers in India.

    @Hxxx said:
    If the problem is only with AWS why just not skip that provider and use whatever works flawless for you, like you said "Linode, DO, Softsys", etc.

  • donlidonli Member

    @elos42 said:
    strace is possible. But I figured out that the problem was with the machine that was making the calls, not at Amazon's end. Phew!

    What was the problem?

  • eva2000eva2000 Veteran
    edited April 2018

    @elos42 said:
    It's a T2 EC2 instance from AWS.

    That's probably why - using T2 burstable or T2 Unlimited ? AWS EC2 T2 are burstable instances so performance can vary and not consistent https://aws.amazon.com/ec2/instance-types/#burst. Try using AWS EC2 non-T2 instances instead.

    T2 instances’ baseline performance and ability to burst are governed by CPU Credits. Each T2 instance receives CPU Credits continuously, the rate of which depends on the instance size. T2 instances accrue CPU Credits when they are idle, and use CPU credits when they are active. A CPU Credit provides the performance of a full CPU core for one minute.

    For example, a t2.small instance receives credits continuously at a rate of 12 CPU Credits per hour. This capability provides baseline performance equivalent to 20% of a CPU core (20% x 60 mins = 12 mins). If the instance does not use the credits it receives, they are stored in its CPU Credit balance up to a maximum of 288 CPU Credits. When the t2.small instance needs to burst to more than 20% of a core, it draws from its CPU Credit balance to handle this surge automatically.

    T2 base line is 1/5 of 1 cpu core for t2.small. For more info see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-credits-baseline-concepts.html

    • T2 nano baseline = 5% of 1 cpu core
    • T2 micro baseline = 10% of 1 cpu core
    • T2 small baseline = 20% of 1 cpu core
    • T2 medium baseline = 40% of 2 cpu cores
    • T2 large baseline = 60% of 2 cpu cores
    • T2 xlarge baseline = 90% of 4 cpu cores
    • T2 2xlarge baseline = 135% of 8 cpu cores
Sign In or Register to comment.