{"id":3611,"date":"2018-06-03T01:25:34","date_gmt":"2018-06-03T01:25:34","guid":{"rendered":"http:\/\/dalelane.co.uk\/blog\/?p=3611"},"modified":"2019-04-11T14:44:43","modified_gmt":"2019-04-11T14:44:43","slug":"running-a-multi-region-cloud-foundry-application-in-ibm-cloud","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=3611","title":{"rendered":"Running a multi-region Cloud Foundry application in IBM Cloud"},"content":{"rendered":"<p><strong>A few technical details on how I&#8217;m implementing global load balancing to improve the availability of <a href=\"https:\/\/machinelearningforkids.co.uk\/\">Machine Learning for Kids<\/a>.<\/strong><\/p>\n<p>This wasn&#8217;t a great week for <a href=\"https:\/\/machinelearningforkids.co.uk\/\">Machine Learning for Kids<\/a>. I think the site was unavailable for a couple of days in total this week, spread across a few outages &#8211; <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=3606\">the worst one lasting over twelve hours<\/a>. I know I&#8217;ve lost some users as a result &#8211; a few teachers \/ coding group leaders did email me to say (not at all unreasonably) that they can&#8217;t use a tool that they can&#8217;t rely on. <\/p>\n<p>I <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=3606\">wrote in my last post<\/a> that I would be making changes to prevent this sort of thing from happening again. Now that I&#8217;ve done it, I thought it\u2019d be good to share a few details on how I did it. <\/p>\n<p><!--more--><em>(I should say that I owe a thanks to a ton of people at work who helped me with this. Times like this &#8211; when you need to do something new in a hurry &#8211; make me appreciate being able to get help from people across IBM in a hurry by asking a few questions in Slack!)<\/em> <\/p>\n<p><strong>To start with, a recap of how the site was running before this weekend:<\/strong><\/p>\n<p><a href=\"https:\/\/www.flickr.com\/photos\/dalelane\/41621512885\/in\/datetaken-public\/\" title=\"machinelearningforkids-after\"><img loading=\"lazy\" decoding=\"async\" style=\"border: thin black solid\" src=\"https:\/\/farm2.staticflickr.com\/1746\/41621512265_046f8a1407.jpg\" width=\"450\" height=\"172\" alt=\"machinelearningforkids-before\"\/><\/a><\/p>\n<p>The site is <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=3559\">implemented as a Node.js application<\/a>. It was deployed into Cloud Foundry running in <a href=\"https:\/\/www.ibm.com\/cloud-computing\/bluemix\/data-centers\">the US South (Dallas) region of IBM Cloud<\/a>. <\/p>\n<p>By default, there were three instances of the app running.<br \/>\nThis improved reliability (if anything bad happened to cause one instance to crash, there would be two other instances available to keep handling requests without downtime).<br \/>\nIt improved performance (requests were round-robined across the three instances, so they worked together to handle all the requests the site got).<br \/>\nIt also made sure that I kept the implementation stateless, as I built everything assuming that consecutive requests might not go to the same instance.  <\/p>\n<p>I used <a href=\"https:\/\/www.ibm.com\/cloud\/auto-scaling\">an Auto Scaling service<\/a> to add extra instances if things got busy. If the memory usage stayed above 80% for longer than a blip, it would start up an extra instance of the Node.js app to help share the load. And it was configured to keep adding instances if the site got even busier still. <\/p>\n<p>And when things quietened down, and the memory usage dropped and stayed below 30%, the service would kill the extra instances, scaling it down as low as three. <\/p>\n<p>That&#8217;s how I&#8217;d been running the site until now. It meant that the site handled spikes and busy periods (most recently <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=3581\">Scratch Day<\/a>, when the app was being used by kids all around the world). <\/p>\n<p>But&#8230; it meant that went things went wrong in the Dallas region, my site dropped off the Internet. <\/p>\n<p>It was time for the site to grow up a little. <\/p>\n<p><strong>Now the deployment looks like this:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" style=\"border: thin black solid\" src=\"https:\/\/farm2.staticflickr.com\/1748\/41621512885_79d9437fce.jpg\" width=\"450\" height=\"282\" alt=\"machinelearningforkids-after\"\/><\/p>\n<p>It&#8217;s still a multi-instance Node.js application, deployed to Cloud Foundry in IBM Cloud. <\/p>\n<p>But now it&#8217;s deployed to two regions &#8211; the US South (Dallas) region and the EU GB (London) region. <\/p>\n<p>And I&#8217;ve got a load balancer in front, directing the requests. <\/p>\n<p>The US South instance is still going to be the primary instance and, if all is well, serve all of the API requests. When things get busy, there is still the Auto Scaling service to spin up additional instances of the Node.js app in the US South cloud. <\/p>\n<p>But when things go wrong, if the US South instances can&#8217;t be contacted, then the load balancer will start sending requests to the EU GB instance instead. That is only running a single instance normally, as when things are running smoothly, it\u2019ll be idle. But the Auto Scaling service is running there, too &#8211; so it can quickly increase the number of instances to match the needs of the current workload. <\/p>\n<p><em>That first layer, in front of the two IBM Cloud regions, is also providing a cache for all the static resources. This means the Node.js applications won&#8217;t have to do much work to serve the HTML, CSS, JavaScript and images that make up the training tool and Scratch. I&#8217;m not sure I really needed that, as the site generally runs with pretty low CPU and memory even when busy, but it was trivial to add, so I added it anyway.<\/em> <\/p>\n<p>That&#8217;s the big picture summary&#8230; so a few quick pointers for how I set it up. <\/p>\n<p>I&#8217;m using <a href=\"https:\/\/www.ibm.com\/cloud\/cloud-internet-services\">IBM&#8217;s Cloud Internet Services<\/a>. The first step was creating one of those services from the <a href=\"https:\/\/console.bluemix.net\/catalog\/services\/internet-services\">IBM Cloud Catalog<\/a>. That gave me a couple of addresses of nameservers I could use. <\/p>\n<p>I <a href=\"http:\/\/dalelane.co.uk\/blog\/?p=3559#more-3559\">bought the machinelearningforkids.co.uk domain from eukhost<\/a>, so I needed to go to my admin page in eukhost and update the nameservers with those two new cloud.ibm.com addresses. <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-eukhost-ns.png\"\/><\/p>\n<p>I deployed the Node.js applications to the two regions &#8211; to the Dallas and London regions. <\/p>\n<p>I gave the US South deployment the following routes:<br \/>\n<code>machinelearningforkids.co.uk<br \/>\nwww.machinelearningforkids.co.uk<br \/>\nmachinelearningforkids-ussouth.mybluemix.net<\/code><\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-ussouth-routes.png\"\/><\/p>\n<p>I gave the EU GB deployment the following routes:<br \/>\n<code>machinelearningforkids.co.uk<br \/>\nwww.machinelearningforkids.co.uk<br \/>\nmachinelearningforkids-eugb.eu-gb.mybluemix.net<\/code> <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-eugb-routes.png\"\/><\/p>\n<p>Both deployments have the normal\/externally-visible site routes, and they both have a region-specific &#8220;internal&#8221; route. <\/p>\n<p>Note that this means I needed to add the custom domain and it&#8217;s SSL cert to both regions. <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-certs.png\"\/><\/p>\n<p>I also added the certificate to the Cloud Internet Services instance, and set the TLS setting to \u201cEnd to end (flexible)&#8221;<\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-cis-certs.png\"\/><\/p>\n<p>The setup process in Cloud Internet Services looks like this.<\/p>\n<p>I defined a &#8220;Health Check&#8221; &#8211; based on GET request to an API endpoint that doesn&#8217;t do anything other than check the application is responsive. <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-healthcheck.png\"\/><\/p>\n<p>I created a couple of &#8220;Origin Pools&#8221; &#8211; one for the US, and one for Europe. <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-originpools.png\"\/><\/p>\n<p>The US origin pool points at the region-specific route for the US application deployment:<br \/>\n<code>machinelearningforkids-ussouth.mybluemix.net<\/code><\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-origin-us.png\"\/><\/p>\n<p>The Europe origin pool points at the region-specific route for the UK application deployment:<br \/>\n<code>machinelearningforkids-eugb.eu-gb.mybluemix.net<\/code><\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-origin-gb.png\"\/><\/p>\n<p>I defined the &#8220;Load Balancer&#8221;, pointing to both of these origin pools. The order is defined to give priority to the US origin pool.<br \/>\nAnd the hostname is set to &#8220;<code>www<\/code>&#8220;. <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-loadbalancer.png\"\/><\/p>\n<p>Finally, I created a single DNS record. <\/p>\n<p>A CNAME record with the name &#8220;<code>@<\/code>&#8221; and the value &#8220;<code>www.machinelearningforkids.co.uk<\/code>&#8220;. This points all records at the <code>www<\/code> load balancer.  <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-dns.png\"\/><\/p>\n<p>As I said, I also enabled caching. It doesn&#8217;t really help with reliability, but was simple enough to turn on. <\/p>\n<p>I just added a couple of Page Rules, identifying URLs where resources are safe to cache. I\u2019ve long had all the static resources served at addresses starting <code>https:\/\/machinelearningforkids.co.uk\/static\/*<\/code> and every build adds a unique build timestamp in the resource paths, so it&#8217;s safe to cache everything under here forever. <\/p>\n<p><img decoding=\"async\" style=\"border: thin black solid\" src=\"http:\/\/dalelane.co.uk\/blog\/post-images\/180603-mlforkids\/180603-cis-cache.png\"\/><\/p>\n<p>That&#8217;s pretty much everything. <\/p>\n<p>To sum it all up:<\/p>\n<p><a href=\"https:\/\/www.flickr.com\/photos\/dalelane\/41802229324\/in\/datetaken-public\/\" title=\"machinelearningforkids-after-notes\"><img loading=\"lazy\" decoding=\"async\" style=\"border: thin black solid\" src=\"https:\/\/farm2.staticflickr.com\/1727\/41802229324_d162b65750.jpg\" width=\"450\" height=\"263\" alt=\"machinelearningforkids-after-notes\"\/><\/a><\/p>\n<p><strong>(1)<\/strong><br \/>\nThe user&#8217;s web browser goes to <a href=\"https:\/\/machinelearningforkids.co.uk\">https:\/\/machinelearningforkids.co.uk<\/a><br \/>\nThe nameservers for this address are now delegated to IBM CIS. <\/p>\n<p><strong>(2)<\/strong><br \/>\nAny requests for static resources (CSS, images, JavaScript, HTML components) are returned from the CIS cache directly<br \/>\nAPI requests need to go to a Node.js application. The DNS routing directs the request to the <code>www<\/code> load balancer<\/p>\n<p><strong>(3)<\/strong><br \/>\nThe <code>www<\/code> load balancer uses the health check I defined to poll the two origin pools regularly.<br \/>\nIf it knows that the US South deployment is healthy and responsive, it directs the request to <code>machinelearningforkids-ussouth.mybluemix.net<\/code><\/p>\n<p><strong>(4)<\/strong><br \/>\nThe US South cloud is registered to handle requests to <code>machinelearningforkids-ussouth.mybluemix.net<\/code> because that route is in the application&#8217;s deployment manifest.yml file<\/p>\n<p><strong>(5)<\/strong><br \/>\nCloud Foundry directs the request to one of the Node.js app instances<\/p>\n<p><strong>(6)<\/strong><br \/>\nThe Auto Scaling service monitors the memory usage of all of the Node.js app instances, adding or removing instances to keep the memory within thresholds that I\u2019ve set.<\/p>\n<p>This means the site is still going to be able to handle spikes and busy periods. And it should now remain accessible in the event of regional outages. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>A few technical details on how I&#8217;m implementing global load balancing to improve the availability of Machine Learning for Kids. This wasn&#8217;t a great week for Machine Learning for Kids. I think the site was unavailable for a couple of days in total this week, spread across a few outages &#8211; the worst one lasting [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[587],"class_list":["post-3611","post","type-post","status-publish","format-standard","hentry","category-tech","tag-mlforkids-tech"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3611","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3611"}],"version-history":[{"count":0,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3611\/revisions"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3611"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3611"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3611"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}