AWS S3 is the storage provided by AWS which stands for Simple Storage Service and almost every service AWS provides uses S3 one way or another.
- S3 stores objects and object in this context means the data stored as well as metadata related to it.
- Maximum object size you can store on S3 is 5TB
- The data persisted on S3 is automatically replicated within the same region.
- S3 is an eventually consistent means, you might get stale data if you update an existing object and immediately retrieve it or delete an object and immediate retrieve might return the deleted object.
- You can use ACL (Access Control List) which is the legacy access control mechanism for S3
- If you have several objects and want to get higher request rates, consider some randomisation for your keys eg. adding a hash prefix to them
- S3 has 11-nines durability and 4-nines availability.
- There are 2 types of metadata: System & User
- Key of the object can be maximum of 1024 bytes of unicode UTF-8 characters
- All S3 objects are private by default
- Objects are stored in containers called buckets
- Each AWS account can have up to 100 buckets
- Each bucket can hold unlimited amount of data
- Bucket names are globally unique
- Bucket names must be at least 3 and at most 63 characters
- Bucket names can only contains lowercase letters, numbers, hyphens and periods and must start with a lowercase or number
- Versioning happens at bucket level.
- Once versioning is enabled it cannot be removed but can only be suspended.
- MFA (Multi Factor Authentication) Delete gives more protection for data for both intruders and accidental deletion
- Only root account can enable MFA Delete
You can create pre-signed URLs for time limited access to objects. To create a pre-signed URL you need:
- Security Credentials
- Bucket Name
- Object Key
- HTTP Method
- Expiration date & time
S3 Storage Classes:
There are 3 storage classes for S3
- Standard : High performance and low latency
- Standard-IA (Infrequent Access) min object size is 128KB & min duration is 30 days
- RRS (Reduced Redundancy Storage) – durability is 4-nines: For easily re-generatable data like thumbnails
- Glacier: For archiving data where retrieval is going to take 3 to 5 hours. (5% of data currently stored in glacier can restored for free per month)
Lifecycle of data stored can be automated to decrease the cost. For example, data on S3 Standard can be automatically moved to S3 Standard IA after 30 days and then to amazon glacier after 90 days and then deleted from glacier after 3 years.
Since you might be uploading very big data, it makes sense to make it multipart. S3 supports multipart upload. Parts can be uploaded in arbitrary order with retransmission if needed and S3 assembles after all parts uploaded. You should use multipart for data larger than 100MB and you must use multipart for data larger than 5GB.
Multipart upload works in 3 stages:
- Uploading parts
- Completion (or abort)
If you use AWS SDK, multipart upload is taken care by the SDK automatically.
You can download portion of an object in the range of bytes. This is supported by both S3 and Glacier
Cross Region Replication:
- You can setup cross region replication for your S3 buckets.
- The replication is an asynchronous operation.
- You need to define source and target buckets.
- The replication is for everything including metadata and ACLs.
- For this to work, you need to turn versioning on for both buckets.
- If cross region replication is turned on for an existing bucket, only new objects will be replicated. Objects that already existed must be copied manually.
Server access logs are off by default but can be enabled. Logs can go into the bucket itself or into another bucket.
- Event notifications are at bucket level.
- You can use a prefix or suffix to create notifications.
- Notifications can go to:
- SNS (Simple Notification Service)
- SQS (Simple Queue Service)
- AWS Lambda
Encryption (at rest):
Amazon KMS uses 256-bit AES (Advanced Encryption Standard)
There are 3 different SSE (Server Side Encryption) provided by AWS:
- Key management and protection is done by AWS
- Each object is encrypted with a unique key
- Each key is further encrypted with a master key
- Master key is rotated at least monthly automatically by AWS
- Data, object keys and master key are all stored separately for further protection
- Key management and protection is done by AWS
- You can manage the keys with AWS KMS
- AWS KMS also provides detailed audit information
- SSE-C (Customer Provided)
- All encryption and decryption is done on AWS
- You maintain full control for your keys
SSE-S3 or SSE-KMS are the simplest and easiest to use for server side encryption.
- Glacier is for infrequently accessed data.
- Retrieval is between 3 to 5 hours.
- Glacier stores data in archives which are contained in vaults.
- Data is stored in archives
- Each archive can hold up to 40TB of data
- You can have unlimited number of archives
- A unique id is assigned for each archive at creation time automatically
- Archives are automatically encrypted
- Archives are immutable, once created cannot be changed.
- Can specify as WORM (write once read many)
- Once locked, it cannot be changed
- Each account can have a maximum of 1000 vaults
- Each vault can store unlimited archives
- 5% of data stored in glacier can be restored free of charge monthly.
- You can set data retrieval policy to make sure you are not going over your budget by mistake.
Glacier vs S3
- 40TB archives vs 5TB objects
- System generated Id vs user assigned Id
- Automatic encryption vs encryption at rest is optional