AWS Cloud Computing Essentials for Data Science: A Guide
Written on
Chapter 1: Understanding Cloud Computing
Imagine yourself as the organizer of data science networking events in New York. A key aspect of your role involves securing a venue to host your guests. While you've traditionally rented spaces throughout the city, you begin to ponder whether purchasing your own venue might alleviate the stress of finding availability each time.
In this analogy, the venue represents a computer, and the cloud computing sector would emphatically advise against buying one. The prevailing argument is that renting computer resources in a data center is more economical, secure, and scalable. The cloud providers would suggest, "Let us handle the venue arrangements for you. In fact, why not focus solely on hosting an outstanding event without the burden of venue management?"
The notion of shared computing resources dates back to the 1960s, with the term "cloud computing" emerging in the 1990s. However, it wasn't until the 2000s that the sector began to evolve into the massive industry we see today. In 2021, the global cloud computing market reached an impressive $445.3 billion, with projections indicating it could exceed $947.3 billion within the next five years!
What fuels the widespread adoption of cloud computing? How can data scientists, machine learning engineers, and software developers effectively utilize it? These inquiries are complex and extensive, necessitating more than a single post to address them. This article will introduce the concept of cloud computing, explaining its significance and relevance. Subsequent posts will delve deeper into the two primary components of cloud computing: storage and compute.
Our focus will be on Amazon Web Services (AWS), the dominant player and a key force behind the cloud revolution. However, the principles discussed here also apply to other major cloud providers like Google Cloud Platform (GCP), Microsoft Azure, and Alibaba Cloud. Let’s dive in!
What Is Cloud Computing?
To simplify, the cloud consists of numerous computers housed in various data centers. A data center is essentially a secure facility designed to accommodate a large number of operational computers, often located in areas with low electricity costs or favorable climates.
These computers lack monitors or keyboards; they are purely the hardware responsible for processing calculations, storing data, and responding to requests. We refer to these machines as servers to differentiate them from the personal laptops or desktops most people are accustomed to, as they cater to user requests.
The data centers operated by leading tech companies—like Amazon, Meta, and Google—feature millions of servers organized in racks that stretch as far as the eye can see. Each time a user uploads a photo to Instagram or interacts on WhatsApp, they are engaging with several of these servers.
Section 1.1: The Advantages of Cloud Computing
You might be familiar with the ease of cloud storage if you've used services like iCloud, Dropbox, or Google Drive. These platforms allow you to recover lost texts, share files seamlessly, and organize your photos by content.
However, cloud computing extends beyond personal convenience; it can significantly enhance your professional endeavors as well. Not all of Amazon's servers are devoted to processing shoe searches or determining ad placements. Many of these servers are available for you to rent.
The primary advantage of cloud computing is the ability to rent servers without the need for purchasing and maintaining them. If the cloud sector had a catchphrase, it would be, "Access the resources you need, when you need them." Just like a cloud can take on different shapes, the industry allows users to customize their resources to fit their requirements.
For example, if you are launching a dating app, you'll need storage solutions for user photos and systems to train recommendation algorithms. You can easily rent a server optimized for storage to host photos, another for running calculations for ranking models, and a smaller one for handling site hosting and user authentication.
As your app grows—say, around Valentine's Day—you can scale up the number of servers and then reduce them as user engagement decreases. You only pay for what you utilize, avoiding upfront hardware costs and eliminating worries about server failures since cloud providers offer robust backups and redundancy.
Moreover, it's essential to note that customers can reserve just a portion of a server, allowing for even more efficient resource allocation. This virtualization is beneficial for both clients and providers, ensuring that servers remain productive.
The first video titled "AWS for Data Science Basics | AWS Cloud Computing for Beginners | AWS Tutorial" provides an overview of AWS fundamentals and how they apply to data science.
What Is the Cloud: A Formal Definition
According to the National Institute of Standards and Technology (NIST), cloud computing is defined by five key characteristics:
- On-Demand Self-Service: Customers can access the necessary compute resources without needing direct interaction with the provider.
- Broad Network Access: These resources are accessible via the internet and can be utilized across multiple platforms.
- Resource Pooling: Cloud resources are dynamically assigned and reassigned, abstracting their specific locations from the user.
- Rapid Elasticity: Resources can be quickly provisioned or released to respond to fluctuating demand.
- Measured Service: Resource usage is accurately monitored, allowing users to see real-time costs and adjust their allocations accordingly.
Returning to the venue analogy, a "Venue AWS" service would enable users to effortlessly book the appropriate venue size through an app. Users could adjust their reservations dynamically in response to unexpected changes in attendee numbers and see transparent billing for their usage.
When Is Cloud Computing Not the Best Choice?
Before proceeding, it's crucial to consider some drawbacks of cloud computing. Despite their best efforts, cloud providers can experience outages, affecting their customers. For instance, in November 2020, Adobe, iRobot, and Roku faced downtime due to a server integration error at an AWS data center. Additionally, cloud providers are not immune to data breaches, as seen with the leak of Twitch's source code by a disgruntled employee.
If your work involves handling extremely sensitive data or requires constant uptime—such as in emergency response scenarios—it may be prudent to invest in a system that you can fully control.
The second video titled "All Data Scientists Should Know These AWS Compute and Storage Services" explores essential AWS services that every data scientist should be aware of.
What Is AWS?
AWS, or Amazon Web Services, is Amazon's cloud computing platform. Launched in 2002, AWS was part of Amazon's initiative to establish a more service-oriented infrastructure for its software engineers. The ambitious goal was to enhance team autonomy, standardize infrastructure, and facilitate continuous code deployment. This approach necessitated a distributed and scalable software architecture.
The success of AWS led Amazon to offer it as a public service, capturing approximately 32% of the cloud computing market. It provides over 200 services, allowing users to either dive deep into the operating system level or simply launch applications through services that handle most of the complexities.
Setting Up Your AWS Account
Now, let's create an AWS account. Following along with this guide will enhance your understanding and allow for hands-on experimentation later.
To begin, visit the AWS website and click on "Create an AWS Account." AWS offers a free tier for the first year, enabling you to learn and experiment without incurring costs. However, a credit card is required for verification.
The account creation process is straightforward. After completing the steps, you will access the AWS console, which showcases widgets such as recent applications visited and usage statistics.
Identity and Access Management (IAM)
Congratulations on creating your account! Should we start developing an image classifier or a chatbot? Let's begin with identity management.
This may seem like an unexpected starting point, but adhering to security best practices is vital for effectively utilizing the cloud. Your users won't be forgiving if your system gets compromised, especially if it involves sensitive data. Additionally, an attacker could rack up significant costs before you manage to secure your account.
When you established your AWS account, you created a root user, which possesses extensive privileges across all services. AWS advises new users to set up multi-factor authentication (MFA) to enhance security.
Navigate to IAM (Identity and Access Management) to manage user permissions. The first step is to add MFA, selecting a method like an authentication app, to add a layer of security to your root account.
Once MFA is configured, users will need additional proof to access the root account, which is crucial since you’ll typically work with user profiles that have restricted permissions based on your workflow.
Let's set up an IAM profile by creating a user group with specific permissions applicable to all users within that group. This simplifies onboarding new users by allowing you to add their profiles to an already configured group.
To create a group in IAM, click on "User groups," then "Create group." Name the group (e.g., "admins") and attach the AdministratorAccess policy.
Next, create a user by navigating to "Users," clicking on "Add users," and entering a name for the account. Enable both "Programmatic access" and "AWS Management Console access," allowing you to interact with AWS through both code and the web interface.
After selecting "Next: Permissions," add your user to the "admins" group. You can skip the "Tags" section for now and confirm the creation of the user.
It's important to note that the auto-generated password and secret access key are provided only once, so be sure to save them securely.
Finally, if logging in as an IAM user, you’ll need to remember your AWS account's 12-digit ID. To simplify this, create an alias for easier access.
Command Line Interface (CLI)
To enhance your experience, install the AWS CLI for command line access to your AWS services. This tool streamlines various tasks, such as uploading multiple files to an S3 bucket or automating server scaling.
After installation, run aws configure to input your credentials and set up your environment.
Congratulations! 🎉 You've successfully established an AWS account, implemented security measures, and are now ready to explore.
Conclusions
In this post, we explored cloud computing and Amazon Web Services, using the venue reservation analogy to illustrate the cloud's primary offering: renting servers without the burdens of ownership and maintenance. We discussed how cloud providers enable flexible access to resources through characteristics like on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. We briefly covered AWS’s history and walked through setting up an account, IAM profiles, and the CLI.
With these foundations established, we are ready to delve deeper into AWS, focusing on its core components: compute and storage. Future posts will cover these topics in detail, incorporating practical applications in Python and the CLI. Looking forward to seeing you there!
Best,
Matt