karasms.com

Unlocking the Power of AWS for Data Science: A Cloud Overview

Written on

Imagine you’re organizing networking events for data scientists in New York City. A significant part of your task involves securing a venue for your attendees. Previously, you’ve always rented locations throughout the city, but now you’re considering whether purchasing a venue might eliminate the repetitive search for availability.

In this analogy, the venue represents a computer, and the cloud computing sector would emphatically advise against buying. Instead, they argue that renting computers located in a data center is far more economical, secure, and scalable than ownership. They might suggest, “Allow us to handle the venue arrangements for you. In fact, why not focus solely on creating a fantastic event?”

The idea of shared computer resources has been around since the 1960s, with the phrase “cloud computing” emerging in the 1990s. It wasn’t until the 2000s that this industry began its rapid expansion, evolving into the significant force we see today. For instance, the global cloud computing market was valued at an impressive $445.3 billion in 2021 and is predicted to exceed $947.3 billion in the next five years!

Why is cloud computing so prevalent? And how can data scientists, machine learning engineers, or software developers take advantage of it? These inquiries are complex and warrant deeper exploration than a single article can provide. This post will introduce the concept of cloud computing, including its importance, while subsequent articles will delve into the two primary components: storage and compute.

We’ll concentrate on Amazon Web Services (AWS), the leading entity in the cloud space, which has significantly influenced the cloud evolution. However, the principles discussed here are equally relevant to other major providers, including Google Cloud Platform (GCP), Microsoft Azure, and Alibaba Cloud. Let’s dive in!

What Does "The Cloud" Mean?

In simple terms, the cloud consists of numerous computers housed in various data centers. A data center is a sophisticated, secure facility designed to host a multitude of operational computers, typically situated in locations where electricity costs are low or where temperatures are cooler.

These machines operate without any monitors or keyboards; they are the essential hardware that executes calculations, stores and retrieves information, and responds to HTTP requests. We refer to these devices as servers to differentiate them from user-friendly laptops and desktops, as they fulfill requests from users.

The data centers owned by major tech companies—such as Amazon, Meta, and Google—feature millions of servers stacked in rows, extending as far as the eye can see. Whenever someone uploads a photo of a cat to Instagram, saves a pair of shoes on Pinterest, or replies to a message on WhatsApp, they interact with several of these servers.

Why Utilize the Cloud?

You may already appreciate the advantages of cloud storage if you have ever used services like iCloud, Dropbox, or Google Drive. You can retrieve lost texts if your phone is misplaced; share files through links instead of cumbersome email attachments; and categorize and search your photos by the people featured in them.

However, the cloud’s benefits extend beyond personal convenience; it can also enhance your professional life. Not all of Amazon's servers are dedicated to processing shoe search queries or determining which users should see ads for Elden Ring. Many of these servers are available for you to rent.

The primary advantage of the cloud is the ability to rent servers without the need for ownership and maintenance. If the cloud industry had a motto, it might be “Whatever resources you require, whenever you need them.” If servers are akin to droplets in a physical cloud, this industry prides itself on allowing users to shape that cloud to fit their specific requirements.

For example, if you are launching a dating app, you will need a method to store user images, as well as a mechanism to train recommendation systems for matching users. You can easily rent a storage-optimized server for the images, another server optimized for running calculations for ranking models, and perhaps a smaller one for hosting the website and managing user authentication.

As your app’s audience grows around Valentine’s Day, you can scale these three servers to four, five, or more, and then reduce them back to three as user activity declines. You can even eliminate the calculations server entirely if you decide to match users randomly.

Throughout this entire process, you only pay for what you consume. You also evade the initial expenses associated with purchasing hardware, and you won’t be responsible if a user uploads an overwhelming number of photos that causes your storage server to fail. Cloud providers incorporate substantial backups and redundancy, allowing clients to forget about the precise machines running their services.

Moreover, when I mention “servers,” it may seem as though you are renting the entire machine. However, to simplify cloud utilization, customers can reserve a fraction of a server, which still operates like a standalone machine. This virtualization benefits both users and cloud providers—you can reserve as little compute or storage as needed, and providers can maximize server utilization by sharing them across multiple clients.

Formal Definition of the Cloud

The National Institute of Standards and Technology (NIST) offers a formal definition of the cloud, outlining five key characteristics:

  1. On-demand self-service

    Users can select the computing resources they require without needing direct interaction with the service provider. For instance, a user can click a button to reserve a server for their website and another to release it.

  2. Broad network access

    These resources are accessible over the internet and can be utilized through various platforms. For example, a user can reserve a server from their laptop and later monitor its status using their phone.

  3. Resource pooling

    Cloud resources are dynamically allocated and reassigned, with the specific location of these resources being abstracted from the user. For instance, a user can reserve two servers without needing to know their exact locations in different data centers.

  4. Rapid elasticity

    Resources can be quickly provisioned and released to match demand. A user can automatically employ more servers when traffic to their application surges.

  5. Measured service

    Resource usage is meticulously monitored, visible, and manageable. For example, a user can track in real-time the costs associated with the servers hosting their application and adjust their allocation as business needs evolve.

Returning to the venue analogy, a “Venue AWS” service would claim that you can automatically secure a venue of any size through their website or app (points 1 and 2). These venues would come from a large pool that appears and disappears based on user events (point 3). If your expected number of attendees fluctuates, you can dynamically adjust the venue size as many times as needed (point 4). Lastly, you can see exactly what you’re paying for and opt out if you change your mind at any time (point 5).

When Might the Cloud Not Be Suitable?

Before proceeding, it’s important to address the downsides of cloud computing. Despite their best efforts, cloud providers do occasionally experience failures, resulting in service interruptions for their users. In November 2020, for example, companies like Adobe, iRobot, and Roku suffered outages due to a misconfiguration in the addition of new servers to an AWS data center. Additionally, cloud providers are not immune to data breaches, as evidenced by incidents such as the leak of Twitch’s source code or data exposure by a disgruntled employee.

If you’re working in a field where user information is highly sensitive (e.g., social security numbers) or your application cannot afford any downtime (e.g., emergency services), it may be prudent to invest in a system you can fully control.

Overview of AWS

Amazon Web Services (AWS) is Amazon’s cloud computing platform. Launched in 2002, AWS was part of Amazon’s effort to create a more service-oriented architecture for its software engineers. The ambitious goal was to enhance team autonomy, adopt REST APIs, standardize infrastructure, eliminate gatekeeping, and continuously deploy code. To realize this vision, Amazon recognized the necessity for a distributed, scalable software architecture.

The initiative was so successful that Amazon later transformed AWS into a publicly available product. Currently, AWS commands approximately 32% of the cloud computing market, with Microsoft Azure and Google Cloud holding 20% and 9% respectively. AWS offers over 200 services, providing various levels of abstraction. You can choose to delve deep into the operating system level and refine the server foundations for your application or simply launch your app using a service that manages most of the details for you.

Getting Started

Let’s create an AWS account. The subsequent articles will be more comprehensible if you can follow along and later experiment on your own.

Start by visiting the AWS website and clicking “Create an AWS Account.” AWS provides most of its services free for the first year, enabling you to learn and explore without financial concerns (though you must attach a credit card in case you decide to venture beyond the introductory materials).

Assuming you’re not a bot, the steps should be straightforward to complete. Once your account is created, you’ll see the console home with options such as recently visited applications, “Welcome to AWS” links, cost and usage details, and more.

IAM (Identity and Access Management)

Congratulations! Should we begin building an image classifier or a chatbot? Perhaps a video game engine or a satellite controller? Actually, let’s start with identity management.

While it may seem anticlimactic to kick off our AWS journey with identity management, adhering to security best practices is crucial for effectively utilizing the cloud. Your app’s users are unlikely to be forgiving if you fall victim to a hack—especially if their data is compromised! Moreover, an attacker could accumulate significant expenses before you manage to cancel your credit card. On a more positive note, setting up identity management allows you to smoothly and securely onboard new developers to your project.

When you created your AWS account, you established a root user, an all-powerful account that can perform any action on any service, create and delete other user profiles, access and modify payment information, and close the account. This level of power is concentrated in one place. In fact, the very first recommendation from AWS for new users is to set up multi-factor authentication (MFA) to make it more challenging (though not impossible) for attackers to access the root account.

You will see a prominent warning indicating that the root user is unprotected if you navigate to IAM, AWS’s Identity and Access Management service (search for “IAM” in the search bar).

Let’s address this vulnerability—click on “Add MFA,” choose your preferred method (like an authenticator app), and follow the instructions. Once MFA is set up, anyone attempting to log in as the root user will need to provide additional verification. This added step is worthwhile; you typically won’t need to access the root account regularly.

Instead, you’ll generally log into a user profile with permissions tailored to your usual tasks. For instance, your day-to-day responsibilities likely don’t involve updating credit card information or modifying customer passwords, so you can restrict these actions for a user profile (including yourself unless you log in as the root user).

In fact, IAM profiles cannot perform any actions unless explicitly granted permission, a security practice known as least privilege. This significantly minimizes the potential damage a compromised account could cause. For example, the blue user may only access S3 and DynamoDB, while the orange user may access S3, Lambda, API Gateway, and CloudWatch. Moreover, even within a service like S3, the blue user’s access to actions or content can differ from the orange user’s permissions based on their specific business requirements.

Now, let’s set up an IAM profile. We’ll start by creating a user group with certain permissions that will automatically apply to any user added to that group. This simplifies onboarding new users, as we can simply add their profile to a preconfigured group with the necessary permissions.

To create a group, navigate to IAM, click on “User groups” on the left, and then click “Create group” on the right. We can name this group admins, scroll down, search for the AdministratorAccess policy, and attach it to our group.

Next, we’ll create our user. Click on “Users” on the left, select “Add users,” and enter a name for the account. We will choose both “Programmatic access” (to access AWS via code) and “AWS Management Console access” (to sign in as our user instead of the root user).

Proceed to “Next: Permissions” and add the user to our admins group. “Next: Tags” allows for adding tags for search purposes (useful if you have many users), but we can skip that for now. In the review screen, confirm that everything appears correct, and click “Create user.”

The next screen is crucial! The auto-generated password and secret access key (for accessing AWS via code) are provided only once. Before exiting the page, ensure you write down your access key ID, secret access key, and auto-generated password in a secure location—especially the access key ID and secret access key, as they are all that’s needed for anyone on the internet to access your AWS services through a Python script.

Excellent! One last point: if we log in as an IAM user, we’ll need to provide our AWS account’s 12-digit account ID. This number can be cumbersome to remember unless you use a password manager or have a great memory for numbers. Instead, let’s create an alias for the account (essentially a username) so that logging in is easier. Unlike our IAM profile name, this username must be unique across all of AWS, so you might need something more specific than matt.

CLI (Command Line Interface)

Finally, install the AWS CLI to access your AWS services via the command line. (Python users can simply run pip install awscli.) While the user interface generally suffices for our needs, there are times when a command-line interface is invaluable—for instance, uploading 10,000 CSV files into an S3 bucket can be accomplished with a few keystrokes rather than through a series of clicks in the UI. The CLI also facilitates automation of AWS actions, such as scaling server resources via scripts.

Once the CLI is installed, execute aws configure to input your access key ID, secret access key, and other relevant details.

Congratulations! ? You’ve successfully set up an AWS account, followed security best practices, and are now ready to explore.

Conclusion

In this article, we introduced the fundamentals of cloud computing and Amazon Web Services. We employed the analogy of venue reservations to illustrate the cloud's primary offering: the ability to rent servers (computers) without the need for ownership and upkeep. We also discussed how cloud providers facilitate flexible and dynamic server access through on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service.

Additionally, we briefly covered the history of AWS and provided instructions for setting up an account, an IAM profile, and the CLI.

We’re now prepared to delve deeper into AWS and grasp its two foundational elements: compute and storage. Each will be the focus of an upcoming post, where we will explore the essentials of these offerings while engaging with Python and the CLI. Looking forward to seeing you there!

Best, Matt

Footnotes

1. What is the cloud?

Technically, your HTTP request to see Justin Bieber’s latest tweet likely does not reach one of these massive data centers directly. Instead, it probably connects with the nearest content delivery network (CDN) node (also known as a point of presence, PoP), one of thousands of smaller data centers distributed globally.

CDNs ease the burden on data center databases (and the internet in general) by caching frequently accessed content. It’s significantly faster to pull data from a cache than from a disk, allowing the server to promptly return that Bieber tweet instead of searching through trillions of tweets. In fact, Netflix can provide such a seamless streaming experience because they utilize extensive CDNs.

However, CDNs have limitations; otherwise, they would be the sole option. The hardware used for caches can be expensive—1 TB in AWS costs $23 on S3 and $85 on CloudFront, for instance. Moreover, there is plenty of content that we cannot store in a cache, such as user data that requires secure login access.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# The End of an Era: The Sky Satellite Dish Retires

Sky's satellite dish installations are ending, marking a significant shift in broadcasting and a move toward internet-based content delivery.

A Personal Journey Through Depression and Natural Healing

A personal account of overcoming depression with natural remedies, focusing on the TenseTranquil™ Singing Bowl.

generate a compelling exploration of our cosmic future

A thought-provoking discussion on the future of humanity in light of cosmic realities, blending science fiction with philosophical insights.

Asking Ukraine for Concessions: A Reckless Proposition

The notion of Ukraine conceding land is not only unreasonable but also dangerously irresponsible, given the war's impact on its citizens.

Unlocking Consciousness: Can Psychedelics Revive Coma Patients?

Exploring the potential of psychedelics in reviving consciousness in coma patients through innovative therapy.

The Profound Impact of Words: Choose Wisely Before You Speak

Explore how words affect emotional well-being and relationships, emphasizing the importance of mindful communication.

Unlocking the Secrets: 3 Habits of the Wealthy for Success

Discover three essential habits that billionaires and successful figures practice to achieve their goals and maintain their wealth.

Transforming the Coffee Experience: 5 Future Predictions for Starbucks

Explore five bold predictions on how technology will reshape Starbucks and the coffee industry over the next century.