Friday, July 22, 2011

Clouds: EC2 and GAE in the Indie Dev world.

"Cloud computing" is a big bleeding-edge phrase these days.

In a way, all it really means, is that there's a processing and storage service out there, on which you can build and run software. You don't have to buy or maintain the machines, and for the most part, you don't have to install any of the base software (like operating systems, core services like data and cacheing, etc.). It makes app dev a lot easier; you register an account with your cloud provider, they allocate you a CPU and storage budget, and provide a way for you upload your software to run. This is of course primarily Internet-based apps, anything from a website to a full-on financial RIA.

In another way, it also means, that you now work under the restrictions of that cloud provider. If they don't want you running certain classes in their cloud infrastructure, they blacklist it; you have to find an alternative. Such restrictions usually involve IO operations like file writes, security and encryption, and other general black hat sort of stuff.

The trick is to find the balance; a provider that gives you the flexibility you want, without making it just as onerous as actually owning and maintaining the machines (things like auto backups and data replication, etc.). Based on what you need, you go towards either end of that balance.

I've been working with cloud computing as an app developer for some time now. From a developer's perspective, it's all pretty much the same; "what technology, what APIs/frameworks, where do I put it, what basic server admin do I have to perform." So while I heard a lot of people talk about "the cloud" as a mysterious and nebulous thing, I always just thought of it as another server out there to run my cruft.

It's evolving though; it's becoming RIDICULOUSLY easy to put a high end app out there, relative to the difficulty of even a few years ago.

The two clouds I use these days:

- EC2 (Amazon). Couples with their S3 (simple storage) service, provides a way to actually spin up remote machines of almost any configuration. For example, Amazon has a Linux base called a "micro instance", that you can spin up and access via it's IP address just like any other remote server. You log in as ec2-user with your public/private key, switch to root (you can not log in as root), and SSH all day. Usually you have to employ yum a lot, to install apache, FTP, and what have you, and do a lot of RPM gets to fill in some blanks, but they do a good job of making pretty much everything for a basic server confg available.

Admin is simple via the EC2/S3 web interfaces. You can spin up buckets for storage, start/stop instances (which starts/stops billing...you only pay for used CPU and storage space, if your instances are all shut down then all you pay for is storage), make and edit security groups, generate .pems for your keys, and so on. And it's CHEAP. As a dev, I spin up a micro instance, install apache, tomcat, whatever, check the python and perl installs, and am pretty much ready to stage anything. If I shut down the instance when I'm not using it (so say I leave it running 8 hours a day), my bill at the end of the month is literally a few bucks.

Scaling is up to you. If your micro instance gets slammed, it'll bork. There are services out there that can do this for you; if you're trying to get out of the hardware end of things by getting into cloud computing, a service like this is essential, or you're just going to end up hiring your network and server guys all over again. You can of course also manage it yourself, the EC2 admin tools provide APIs and such to get it done.

EC2 I like it a lot. I actually canceled all my traditional server hosting, which was costing me in excess of $200 a month just to be able to stage apps for clients, and now it costs me maybe $10 or so. It's also very flexible; basically, there's no difference to me, it's easy to think you actually have a real server out there, when all you actually have is a chunk of the CPU clay pulled off the big CPU clay ball; you configure that chunk however you want, be it Windows, Linux, Unix, web server, app server, whatever. There are some security and storage restrictions, but nothing I've ever run into in day-to-day dev.

- Google App Engine (GAE)

I've been writing code as a subcontractor for Google projects these days, (YouTube Town Hall, that sort of thing), and use the GAE for my backends.

GAE goes the other way from EC2; where EC2 is heavy on flexibility in favor of adding some complexity (like needing public private keys for logins, configuring security groups, having to install things on your instances, monitoring the instances carefully), GAE provides a prescribed way of developing your apps. You can use either Python or Java (unlike the EC2, in which you can use and dev technology I've ever seen). Access for devs is tied to a gmail account. You just sign up for the GAE, get an account/app key, and can deploy 10 apps, utilizing a decent amount of CPU and storage, for free (for now anyway).

Regarding "Java or Python for the GAE", I've used both; in general, I lean towards Python unless Python makes it difficult, and it often does. I have very unstable results with PyAMF on the GAE, but know I can get AMF running in Java on the GAE, so when I need AMF, I use Java. Also, the Objectify data framework for GAE (Java based) is very powerful and well written, it makes using the GAE datastore pretty straightforward.

You don't have the tactile sensation EC2 gives you, which is of "solid server". You deploy your code as an upload (like FTPing), and it compiles, validates, and deploys it. You then access the functioning app online at [yourapp].appspot.com. Servlets, services, and all that, provided they made the blacklist cut, will all run as expected. You have an admin interface that tells you how much CPU and storage your app is using, and you can see things like what's in the task queue, inspect your data store, see usage stats, and whatnot. The data service is in the form of an object store (not a relational DB), there's memcache service, a task queue (configurable queues too), cron capability, a bulk data uploader/downloader (which is an admin tool, not an app tool) and some messaging capability. Everything you need to build out a fairly heavy duty app.

However, those services are what you have to use if you want to do your processing with GAE resources. Unless there are arrangements different than the default one I have, what you see is what you get. Things like file read/writes and such are also much more heavily restricted in the GAE, putting some solid walls around your app and what it can and can't do. If you're inventive, you can generally use the tools they provide to meet the need.

Scaling is braindead easy in the GAE; you don't think about it really. You just up your budget for CPU/Storage, and you get more CPU time. If your app hits a threshold, the GAE also automatically spins up more "Instances" so that your app can load balance (sort of). You don't worry about it at all, as long as you have the budget, you're good. Even under very heavy load (like hundreds of thousands of hits) I've not seen this budget exceed 60 bucks in a day, and that's VERY unusual, typically it's a few dollars for a reasonably busy app that isn't doing huge background processing jobs (which I would offload to the EC2...)

Because the GAE is much more prescribed, all you have to do is get that prescribed model, then it's pretty EASY. You fall into repeatable patterns because there aren't multiple ways to do things like data writes; you use Objectify on the GAE datastore API, or use JDO and beware blacklisted classes. It's also very easy to admin; you go to your GAE interface and can inspect the data store, task queues, see how much CPU and storage your app is using, etc. There are some deeper tools too, like the ability to use custom domains instead of appspot ones, but I haven't needed anything like that yet.

Which do I like better? Hard to say, because I've combined them for a couple of apps. I use the EC2 to spin up instances on which, say, I need to do a lot of scheduled processing, like image or video download/processing. If I have particular requirements from a client for data storage or web serving, EC2 is also the way to go. But for deploying and serving general web application, even ones with some pretty slick data and task requirements, the GAE, as long as python or java is ok with you, is a great, easy, and even fun, way to get the job done.

Either way, both save you a lot of money and headache. Naturally, the heavier your use, the more money and headaches you'll expend, but for general app dev, particular for testing and staging, and even for long-run apps if you know what you're doing, dump your old hard boxes and use the clouds.

As always, thanks for visiting.