Welcome to Skyplane!#

Note

This project is under active development.

đŸ”„ Blazing fast bulk data transfers between any cloud đŸ”„

pip install skyplane[aws]
skyplane init
skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/

Skyplane is a tool for blazingly fast bulk data transfers between object stores in the cloud. It provisions a fleet of VMs in the cloud to transfer data in parallel while using compression and bandwidth tiering to reduce cost.

Skyplane is:

  1. đŸ”„ Blazing fast (110x faster than AWS DataSync)

  2. đŸ€‘ Cheap (4x cheaper than rsync)

  3. 🌐 Universal (AWS, Azure and GCP)

You can use Skyplane to transfer data:

  • between object stores within a cloud provider (e.g. AWS us-east-1 to AWS us-west-2)

  • between object stores across multiple cloud providers (e.g. AWS us-east-1 to GCP us-central1)

  • between local storage and cloud object stores (experimental)

Skyplane currently supports the following source and destination endpoints (any source and destination can be combined):

Endpoint

Source

Destination

AWS S3

✅

✅

Google Storage

✅

✅

Azure Blob Storage

✅

✅

Local Disk

✅

(in progress)

Skyplane is an actively developed project. It will have đŸ”Ș SHARP EDGES đŸ”Ș. Please file an issue or ask the contributors via the #help channel on our Slack if you encounter bugs.

Contents#

Installation#

We’re ready to install Skyplane. It’s as easy as:

---> Install skyplane from PyPI:
$ pip install "skyplane[aws]"

# install support for other clouds as needed:
#   $ pip install "skyplane[azure]"
#   $ pip install "skyplane[gcp]"
#   $ pip install "skyplane[all]"

Note

If you are using an M1 Macbook with the arm64 architecture, you will need to install skyplane as follows:

$ GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install skyplane[all]

Setting up Cloud Credentials#

Skyplane needs access to cloud credentials to perform transfers. To get started with setting up credentials, make sure you have cloud provider CLI tools installed:

---> For AWS:
$ pip install awscli

---> For Google Cloud:
$ pip install gcloud

---> For Azure:
$ pip install azure

Once you have the CLI tools setup, log into each cloud provider’s CLI:

---> For AWS:
$ aws configure

---> For Google Cloud:
$ gcloud auth application-default login

---> For Azure:
$ az login

Now, you can initialize Skyplane with your desired cloud providers. Skyplane autodetects cloud credentials and valid regions from your CLI environment.

---> Setup cloud provider connectors:
$ skyplane init

Transferring Data via Skyplane CLI#

We’re ready to use the Skyplane CLI! Let’s use skyplane cp to copy files from AWS to GCP:

---> 🎾 Ready to rock and roll! Copy some files:
$ skyplane cp -r s3://... gs://...

To transfer only new objects, you can instead use skyplane sync:

---> Copy only diff
$ skyplane sync s3://... gs://...

To transfer from local disk or HDFS cluster, you can use skyplane cp as well:

(Note: On-Prem require additional setup. Please navigate to the On-Prem section for more details)

---> Copy from local disk
$ skyplane cp -r /path/to/local/file gs://...

---> Copy from HDFS
$ skyplane cp -r hdfs://... gs://...

Quickstart#

CLI#

The simplest way to run transfers on Skyplane is to use the CLI. To transfer files from a AWS to GCP, you can run:

skyplane cp -r s3://... gs://...

You can also sync directories to avoid copying data that is already in the destination location:

skyplane sync s3://... gs://...

Python API#

You can also run skyplane from a Python API client. To copy a single object or folder, you can run:

import skyplane

client = skyplane.SkyplaneClient()
client.copy(src="s3://bucket-src/key", dst="s3://bucket-dst/key", recursive=False)

This will create a skyplane dataplane (i.e. cluster), execute the transfer, and tear down the cluster upon completion.

You can also execute multiple transfers on the same dataplane to reduce overhead from VM startup time. To do this, you can define a dataplane object and provision it:

dp = client.dataplane("aws", "us-east-1", "aws", "us-east-2", n_vms=8)
dp.provision()

This will create a dataplane for transfers between us-east-1 and us-east-2 with 8 VMs per region. Now, we can queue transfer jobs in this dataplane:

# queue transfer 
dp.queue_copy("s3://bucket1/key1", "s3://bucket2/key1")
dp.queue_copy("s3://bucket1/key2", "s3://bucket2/key2")

# execute transfer
tracker = dp.run_async()

# monitor transfer status
remaining_bytes = tracker.query_bytes_remaining()

The queued transfer won’t run until you call dp.run() or dp.run_async(). Once you run the transfer, you can moniter the transfer with the returned tracker object. Once the transfer is completed, make sure the deprovision the dataplane to avoid cloud costs:

# tear down the dataplane 
dp.deprovision() 

You can have Skyplane automatically deprovision dp.auto_deprovision():

with dp.auto_deprovision():
    dp.provision()
    dp.queue_copy(...)
    tracker = dp.run_async()

Now you can programmatically transfer terabytes of data across clouds! To see some examples of applications you can build with the API, you can check out our tutorials on how to load training data from another region and build an Airflow operator.

Configuration#

Skyplane comes with a variety of knobs to tune to adjust performance or change how VMs are provisioned. You can configure these options using the skyplane config set <key> <value> command.

Full list of transfer options

  • CLI configuration

    • autoconfirm: If set, it will not ask for you to confirm the transfers from the CLI. (default False)

    • autoshutdown_minutes: If set, VMs will automatically shut down after this time in minutes. (default 15)

    • usage_stats: If set, Skyplane will send aggregate performance statistics for a collective throughput grid. (default True)

  • Transfer parallelism

    • max_instances: Maximum number of instances to use for parallel transfers. (default 10)

  • Network configuration

    • bbr: If set, the VM will use BBR congestion control instead of CUBIC. (default True)

    • compress: If set, gateway VMs will compress data before egress to reduce costs. (default True)

    • encrypt_e2e: If set, gateway VMs will encrypt data end-to-end. (default True)

    • encrypt_socket_tls: If set, all sockets between gateways will be encrypted with TLS. (default False)

    • verify_checksums: If set, gateway VMs will compute checksums at the source and verify checksums at the destination. (default True)

    • num_connections: Number of connections to use between each gateway. (default 32)

  • Object store configuration

    • multipart_enabled: If set, multipart transfers will be enabled. (default False)

    • multipart_min_threshold_mb: Minimum threshold in MB for multipart transfers. Below this, the object will be transferred in one chunk. (default 64).

    • multipart_min_size_mb: Minimum size per chunk in MB for multipart transfers. (default 8).

    • multipart_max_chunks: Maximum number of chunks for multipart transfers. (default 9990).

    • requester_pays: If set, Skyplane will support requester pays buckets. (default False).

  • Fallback to native commands

    • native_cmd_enabled: If set, Skyplane will fallback to native commands if the transfer fails. (default True)

    • native_cmd_threshold_gb: Transfers smaller than this threshold will be delegated to native commands. (default 2)

  • Instance provisioning configuration

    • aws_instance_class: AWS instance class to use for provisioning. (default m5.8xlarge)

    • aws_use_spot_instances: If set, AWS will use spot instances instead of on-demand instances. (default False)

    • aws_default_region: AWS region to use for provisioning. (default us-east-1)

    • azure_instance_class: Azure instance class to use for provisioning. (default Standard_D32_v4)

    • azure_use_spot_instances: If set, Azure will use spot instances instead of on-demand instances. (default False)

    • azure_default_region: Azure region to use for provisioning. (default eastus)

    • gcp_instance_class: GCP instance class to use for provisioning. (default n2-standard-32)

    • gcp_use_premium_network: If set, will provision VMs on GCP’s premium network tier. (default True)

    • gcp_service_account_name: GCP service account name to use for provisioning. (default skyplane-manual)

    • gcp_use_spot_instances: If set, GCP will use spot instances instead of on-demand instances. (default False)

    • gcp_default_region: GCP region to use for provisioning. (default us-central1-a)

Increasing performance of transfers via paralllelism#

Skyplane transfers files in parallel across a set of gateway VMs. The total bandwidth of a transfer is mostly determined by the number of VMs used. Skyplane will automatically partition and distribute the transfer across the set of gateway VMs.

To increase the parallelism of a transfer, run:

$ skyplane config set max_instances <number>

By default, Skyplane will use a maximum of 1 VM in each region. This limit is conservative since too many VMs can potentially result failed transfers due to VCPU limits and other resource constraints.

If you encounter VCPU limits, increase your VCPU limits following this guide:

Increasing cloud vCPU limits#

Skyplane utilizes parallel VMs to transfer data at high rates. However, if you do not have sufficient vCPU limits to support the number of VMs you need, you can increase the number of VMs you can use by requesting a quota increase from each respective cloud.

Increasing AWS quotas#

To increase your AWS quota, follow these steps:

  1. Go to the AWS EC2 console

  2. Select the region you want to increase your quota for from the top right corner of the page.

  3. Select the Quotas tab.

  4. Search for “Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances” and select the radio button on the left.

  5. Click “Request quota increase”.

  6. In the form, request a higher vCPU allocation.

    • By default, each Skyplane VM uses 32 vCPUs and provides up to 5Gbps of network throughput on AWS.

    • Example: If you’d like to use 8 VMs per region, request 256 vCPUs.

  7. Click “Request”. AWS can take up to a few days to review the request.

Increasing Azure quotas#

To increase your Azure quota, follow these steps:

  1. Go to the Azure Quota request page in the Azure portal.

  2. Filter the quotas by your subscription:

    • Under Search, enter “Standard Dv4 Family vCPUs”.

    • Under the service dropdown, select “Compute”.

    • Under the subscription dropdown, select your Azure subscription.

    • Under region, select the regions you want to increase your quota for.

  3. Checkmark all the quotas you want to increase.

  4. Click “Request quota increase” and select “Enter a new limit”.

  5. Enter the number of vCPUs you want to increase your quota for.

    • By default, each Skyplane VM uses 32 vCPUs and provides up to 12.5Gbps on Azure.

    • Example: If you’d like to use 8 VMs per region, enter 256 vCPUs.

  6. Click “Submit”. Azure can take up to a few days to review the request.

Increasing GCP quotas#

To increase your GCP quotas, follow these steps:

  1. Go to the GCP Console.

  2. Select your project from the top dropdown.

  3. Search for “All quotas” and navigate to the All quotas page.

  4. Filter the quota list:

    • Under “Service”, select “Compute Engine API”.

    • Under “Quota”, select “N2 CPUs”

  5. Select all regions you want to increase your quota for.

  6. Click “Edit quotas”.

  7. Increase the number of vCPUs you want to increase your quota for.

    • By default, each Skyplane VM uses 32 vCPUs and provides up to 7Gbps of network throughput on GCP.

    • Example: If you’d like to use 8 VMs per region, request 256 vCPUs.

  8. Click “Submit request”. GCP can take up to a few days to review the request. In some cases, limit increases are approved immediately via automated review.

Reusing Gateways#

It can take 45s to 2m to provision gateway VMs for a transfer. If you are repeatedly transferring data between the same pair of regions, you can reuse gateway VMs. By default, Skyplane terminates these VMs to avoid unnecessary VM costs.

When running a cp or sync command, pass the --reuse-gateways flag to Skyplane to reuse gateway VMs:

$ skyplane cp --reuse-gateways <source> <destination>
$ skyplane sync --reuse-gateways <source> <destination>

We will attempt to automatically deprovision these gateways after 15 minutes by default. Change this interval via skyplane config set autoshutdown_minutes <minutes>. With --reuse-gateways, Skyplane will start a background job on each gateway VM that triggers a VM shutdown after the specified delay. Note you will still pay for associated VM costs such as storage and network IPs even if VMs are shut down.

To ensure that all gateways are stopped and no longer incur charges, run:

$ skyplane deprovision

Spot Instances to reduce instance costs#

Spot instances reduce the cost of provisioning VMs. These instances are charged at a lower price than on-demand instances but can be preempted at any time. If this occurs, the transfer will fail.

To use spot instances, run:

$ skyplane config set aws_use_spot_instances True
$ skyplane config set azure_use_spot_instances True
$ skyplane config set gcp_use_spot_instances True

Configuring networking between gateways#

Skyplane supports encrypting data end-to-end. This is useful for encrypting data that is stored on a local disk. We enable end-to-end encryption by default. To disable it, run:

$ skyplane config set encrypt_e2e false

Skyplane automatically compresses data at the source region to reduce egress fees from data transfer. We use the LZ4 compression algorithm by default as it can compress data at line rate. To disable compression, run:

$ skyplane config set compress false

Skyplane continually computes checksums at the source region to verify data integrity. We use the MD5 checksum algorithm by default. To disable checksum verification, run:

$ skyplane config set verify_checksums false

Optionally and in addition to end-to-end encryption, Skyplane supports sending data over a TLS encrypted socket. We don’t generally recommend using this unless you have a specific security requirement as it can increase performance variablity of transfers. To enable further TLS socket encryption, run:

$ skyplane config set encrypt_socket_tls true

Architecture#

Skyplane performs high-performance, cost-efficient, bulk data transfers by parallelizing transfers, provisioning resources for transfers, and identifying optimal transfer paths. Skyplane profiles cloud network cost and throughput across regions, and borrows ideas from RON to identify optimal transfer paths across regions and cloud providers.

To learn about how Skyplane works, please see our talk here:

On prem support#

Skyplane now supports local to cloud object store transfers. For this, Skyplane defaults to awscli (for AWS), and gsutil (for GCP). Let us know if you would like to see on-prem support for Azure.

Transfer Integrity and Checksumming#

Skyplane takes several steps to ensure the correctness of transfers. To ensure that data is transferred without corruption (e.g. bit flips or missing byte ranges), Skyplane will compute checksums for data at the source region and verify data matches the checksum before writing back to the destination region. To ensure that no files are dropped during the transfer, Skyplane will query the destination object store after a transfer and check all files were copied with the correct file size. To verify checksums for whole-file transfers, Skyplane computes MD5 hashes at the source region. Upon writing data at the destination, hashes are validated directly in the destintation object store. For multipart transfers, hashses are validated at the destination VM before writing to the object store.

Security#

Data transfers in Skyplane are encrypted end-to-end. This means that all of the data chunks in each transfer are encrypted in the source region, transferred over the network (including through any relay regions) in encrypted form, and decrypted only when they reach destination region. Within the source and destination regions data may be handled in plaintext. For example, chunks are decrypted at the destination gateways and are inserted into the destination object store. For stronger security, the application using Skyplane may choose to store data in the source object store in encrypted form, so that it remains encrypted even in the source and destination regions. To afford higher efficiency for these use cases, Skyplane allows its own encryption to be disabled, to avoid encrypting data twice. The keys used for Skyplane’s end-to-end encryption are generated at the client and then communicated to the gateways over SSH.

HTTP/REST calls made between gateways are enrypted separately, using TLS.

Owing to the above encryption mechanisms, Skyplane guarantees confidentiality against a passive adversary who can view data transferred over the wide-area network and in relay regions. Such an adversary cannot see the contents of the data, but it can potentially see the following:

  • The quantity of data transferred.

  • The network path and overlay path taken by each chunk during the transfer.

  • The size of each chunk (which may be related to the size of the files/objects being transferred).

  • The timing of each chunk’s transfer between gateways and over the network.

Firewalls#

Skyplane adopts best practices to ensure data and gateway nodes are secure during transfers. In this section, we describe the design in-brief. Firewalls are enabled by default, and we advise you not to turn them off. This ensures not only is the data secure in flight, but also prevents gateways from being compromised. Our approach of having unique skyplane VPC and firewalls guarantees that your default networks remain untouched, and we have also architected it to allow for multiple simultaneous transfers! If you have any questions regarding the design and/or implementation we encourage you to open an issue with [Firewall] in the title.

GCP#

Skyplane creates a global VPC called skyplane when it is invoked for the first time with a new subscription-id. Instances and firewall rules are applied on this VPC and do NOT interfere with the default GCP VPC. This ensures all the changes that Skyplane introduces are localized within the skyplane VPC - all instances and our firewalls rules only apply within the skyplane VPC. The skyplane global VPC consists of skyplane sub-nets for each region.

During every skyplane transfer, a new set of firewalls are created that allow IPs of all instances that are involved in the transfer to exchange data with each other. These firewalls are set with priority 1000, and are revoked after the transfer completes. All instances can be accessed via ssh on port 22, and respond to ICMP packets to aid debugging.

AWS#

While GCP VPCs are Global, in AWS for every region that is involved in a transfer, Skyplane creates a skyplane VPC, and a security group (SG). During transfers, firewall rules are instantiated that allow all IPs of gateway instances involved in the transfer to relay data with each other. Post the transfer, the firewalls are deleted.

Azure#

Firewall support for Azure is in the roadmap.

Large Objects#

Skyplane breaks large objects into smaller sub-parts (currently AWS and GCP only) to improve transfer parallelism (also known as striping).

Performance Stats Collection#

Skyplane collects performance statistics. Some of this data is shared with the Skyplane developers to improve the accuracy of the Skyplane solver and for research purposes.

Here are the guiding principles of our collection policy:

  • No surprises — you will be notified before we begin collecting data. You will be notified of any changes to the data being collected or how it is used.

  • Easy opt-out: You will be able to easily opt-out of data collection.

  • Transparency — you will be able to review all data that is sent to us.

  • Control — you will have control over your data, and we will honor requests to delete your data.

  • We will not collect any personally identifiable data or proprietary code/data.

  • We will not sell data or buy data about you.

What data is collected#

We collect non-sensitive data that helps us understand how Skyplane is used. Personally identifiable data will never be collected. Please check /tmp/skyplane/usage/{client_id}/{session_id}/usage_stats.json json file to see the data we collect. Explicitly, we collect:

  1. System and OS information (OS version, kernel version, Python version).

  2. Anonymized client id and transfer session id.

  3. Source region and destination region per transfer.

  4. The collection of command arguments used in the transfer session.

  5. Total runtime and the aggregated transfer speed in Gbps.

  6. Error message if the transfer fails.

How to disable it#

There are two ways to disable performance stats collection:

  1. Run skyplane config set usage_stats false to disable collection for all future transfers. This won’t affect currently running transfers. Under the hood, this command writes {"usage_stats": false} to the global config file ~/.skyplane/config.json.

  2. Set the environment variable SKYPLANE_USAGE_STATS_ENABLED to 0, which temporarily disable the performance stats collection.

Currently there is no way to enable or disable collection for a running transfer; you have to stop and restart the transfer.

How does it work#

When Skyplane runs skyplane cp or skyplane sync command, it will decide whether performance stats collection should be enabled or not by considering the following factors in order:

  1. It checks whether the environment variable SKYPLANE_USAGE_STATS_ENABLED is set: 1 means enabled and 0 means disabled.

  2. If the environment variable is not set, it reads the value of key usage_stats in the global config file ~/.skyplane/config.json: true means enabled and false means disabled. If there is no such key in global config file, then the performance stats collection is enabled by default.

Note: performance stats collection is first-time enabled by default when running skyplane init.

Requesting removal of collected data#

To request removal of collected data, please email us at admin@skyplane.org with the client_id that you can find in /tmp/skyplane/usage/{client_id}/{session_id}/usage_stats.json.

Frequently Asked Questions (FAQ)#

Does the client_id and session_id map to personal data?

No, the uuid will be a random ID that cannot be used to identify a specific person nor machine. It will not live beyond the lifetime of your Skyplane transfer session; and is primarily captured to enable us to honor deletion requests.

The client_id and session_id are logged so that deletion requests can be honored.

Could an enterprise easily configure an additional endpoint or substitute a different endpoint?

We definitely see this use case and would love to chat with you to make this work – email admin@skyplane.org.

Contact us#

If you have any feedback regarding performance stats collection, please email us at admin@skyplane.org.

On-Prem Transfers#

Currently Skyplane supports On-prem from local disk, NFS, HDFS to cloud storages.

HDFS Setup#

Skyplane utilizes Pyarrow and libhdfs for HDFS connection.

Transfer from HDFS requires prior Hadoop and Java installation.

Note that the cluster needs to communicate to the Skyplane gateway. Please change the incoming firewall for the clusters to allow traffic from Skyplane.

Resolving HDFS Datanodes#

A file called hostname is under the ./skyplane/scripts/on_prem folder. This file will be used for hostname/datanode IP resolution. This is for datanode’s internal IP resolution.

  • Copy the hostname/Internal IP for each datanode and the external ip for the corresponding datanode to the file.

  • The hostname after writing all the required information should look like this.

<External IP>   <Datanodes' Hostname or Internal IP>
Testing the Transfer#

Now you can test running skyplane cp to transfer from local disk or HDFS cluster to any cloud storages.

   ---> Copy from local disk
   $ skyplane cp -r /path/to/local/file gs://...

   ---> Copy from HDFS
   $ skyplane cp -r hdfs://... gs://...

FAQ#

I get InsufficientVCPUException when running a transfer#

Skyplane needs to provision VMs in each cloud to complete a transfer. The VMs read data from the source object store and send it along sockets to the destination. By default, Skyplane will provision 1 VM in each region where each VM has 32 vCPUs. If you do not have sufficient quota to provision these VMs, please see our guide to requesting cloud limit increases.

Alternatively, you can configure Skyplane to use fewer resources per transfer. You can reduce the default number of VMs used via max_instances and change the instance class for gateways by configuring aws_instance_class, azure_instance_class or gcp_instance_class with a smaller instance class.

Does Skyplane support local transfers?#

Skyplane does not currently support local transfers. Cloud to cloud transfers are supported.

How does Skyplane map object keys from source to destination?#

In the non-recursive case, Skyplane extracts the key from the full bucket path (s3://[bucket_name]/[key]) and places the object specified at dest_prefix/[key] if the dest_prefix provided is a directory (ends in /), or else replaces the object at dest_prefix with the extracted object.

In the recursive case, Skyplane appends a trailing slash to the source and dest paths if one does not already exist. After extracting the key from the source path, it is appended to the dest prefix to get object keys.

Troubleshooting MissingObjectException#

This exception is raised when:

  • no objects are found at the source prefix passed to the CLI

  • the source prefix is a directory and the --recursive flag is not set

  • there is a mismatch between the source prefix and the key for an object Skyplane is copying

To troubleshoot this exception, carefully check that the requested path is not empty and is accessible via the credential used by Skyplane. If this is the case, ensure that the --recursive flag is set if the source prefix is a directory.

As an example, to transfer the directory s3://some-bucket/some-directory/ to s3://some-bucket/destination/, you would run skyplane cp --recursive s3://some-bucket/some-directory/ s3://some-bucket/destination/.

Troubleshooting MissingBucketException#

This exception is raised when:

  • the source bucket does not exist

  • the destination bucket does not exist

  • the source bucket is not accessible via the credential used by Skyplane

  • the destination bucket is not accessible via the credential used by Skyplane

Using the cloud provider’s console, verify the bucket exists. If so, ensure that Skyplane has access to the bucket.

Note

Requester pays buckets: If it is a public bucket, it may be a requester pays bucket. Any egress fees from this bucket will be paid by the requester (i.e. Skyplane) instead of the bucket’s owner. By default Skyplane disables support for requester pays to avoid unexpected egress charges.

To enable support for requester pays buckets, run skyplane config set requester_pays true.

How can I switch between GCP projects?#

We recommend re-setting GCP credentials locally by running rm -r ~/.config/gcloud then re-running gcloud auth application-default login. You can then set the project ID you want with gcloud config set project <PROJECT_ID>. Once you’ve updated authentication and the project, you can run `skyplane init –reinit-gcp’.

If you get a an error saying Compute Engine API has not been used in project 507282715251 before or it is disabled, wait a few minutes for the API enablement to take effect and re-run skyplane init.

Faster Training Data Loading#

This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow here.

Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region.

imagenet_training

Remote vs. Local Regions#

Say that you have a VM for running training jobs in an AWS region, us-west-2. Reading data from a same-region S3 bucket will be very fast and free. However, if your data is in another region or cloud provider, read the data will be much slower and also charge you per-GB egress fees. In this tutorial, we assume that our data is in a bucket in us-east-1 (the remote region), but we are running training from another region us-west-2 (the local region).

Reading data from S3#

Directly reading data from S3 can be convinient to avoid having to download your entire dataset before starting to train. In this tutorial, we create an ImageNetS3 dataset that extends AWS’s S3IterableDataset object.

import skyplane
import torch  
import torchvision.transforms as transforms  
from torch.utils.data import IterableDataset, DataLoader  
from awsio.python.lib.io.s3.s3dataset import S3IterableDataset  

class ImageNetS3(IterableDataset):
    def __init__(self, url_list, shuffle_urls=True):
        self.s3_iter_dataset = S3IterableDataset(url_list, shuffle_urls)
        self.transform = transforms.Compose(
            [
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
            ]
        )

    def data_generator(self):
        try:
            while True:
                # Based on aplhabetical order of files sequence of label and image will change.
                # e.g. for files 0186304.cls 0186304.jpg, 0186304.cls will be fetched first
                _, label_fobj = next(self.s3_iter_dataset_iterator)
                _, image_fobj = next(self.s3_iter_dataset_iterator)
                label = int(label_fobj)
                image_np = Image.open(io.BytesIO(image_fobj)).convert("RGB")

                # Apply torch visioin transforms if provided
                if self.transform is not None:
                    image_np = self.transform(image_np)
                yield image_np, label

        except StopIteration:
            return

We can create a data loader with the data located in our remote bucket:

    remote_bucket_url = "s3://us-east-1-bucket" 
    data_urls = [
        (remote_bucket_url + "/" if not remote_bucket_url.endswith("/") else remote_bucket_url) + f"imagenet-train-{i:06d}.tar"
        for i in range(100)
    ]
    dataset = ImageNetS3(data_urls)
    train_loader = DataLoader(dataset, batch_size=256, num_workers=2)

However, the latency of this dataloader will be very high and likely degrade training performance.

Tranferring Data with Skyplane#

We can improve our data loader’s performance by transferring data to a local region first. We can do this by running:

    local_bucket_url = "s3://us-west-2-bucket" 

    # Step 1:  Create a Skyplane API client. It will read your AWS credentials from the AWS CLI by default
    client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())

    # Step 2:  Copy the data from the remote bucket to the local bucket.
    client.copy(src=remote_bucket_url, dst=local_bucket_url, recursive=True)

Once the copy completes, the following code will be able to read the training data from the bucket with low latency, and no egress cost:

    data_urls = [
        (local_bucket_url + "/" if not local_bucket_url.endswith("/") else local_bucket_url) + f"imagenet-train-{i:06d}.tar"
        for i in range(100)
    ]
    dataset = ImageNetS3(data_urls)
    train_loader = DataLoader(dataset, batch_size=256, num_workers=2)

Creating an Airflow Operator#

Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator, which can be utilized in data transfer tasks, such as replacing the S3toGCSOperator. The following example demonstrates a data analytics workflow where data is transferred from S3 to GCS to build a BigQuery dataset and then used in a PySpark data analysis job.

airflow

In this tutorial, we extend Airflow’s BaseOperator object to create a custom Skyplane operator, called SkyplaneOperator. We first define the fields of the SkyplaneOperator:

import skyplane
from airflow.models import BaseOperator  # type: ignore

class SkyplaneOperator(BaseOperator):
    template_fields = (
        "src_provider",
        "src_bucket",
        "src_region",
        "dst_provider",
        "dst_bucket",
        "dst_region",
        "config_path",
    )

    def __init__(
        self,
        *src_provider: str,
        src_bucket: str,
        src_region: str,
        dst_provider: str,
        dst_bucket: str,
        dst_region: str,
        aws_config: Optional[skyplane.AWSConfig] = None,
        gcp_config: Optional[skyplane.GCPConfig] = None,
        azure_config: Optional[skyplane.AzureConfig] = None,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        self.src_provider = src_provider
        self.src_bucket = src_bucket
        self.src_region = src_region
        self.dst_provider = dst_provider
        self.dst_bucket = dst_bucket
        self.dst_region = dst_region
        self.aws_config = aws_config
        self.gcp_config = gcp_config
        self.azure_config = azure_config


def execute(self, context):
    pass

Inside the execute function, we can instantiate a Skyplane client to create a dataplane and execute transfers:


def execute(self, context):
    client = SkyplaneClient(aws_config=self.aws_config, gcp_config=self.gcp_config, azure_config=self.azure_config)
    dp = client.dataplane(self.src_provider, self.src_region, self.dst_provider, self.dst_region, n_vms=1)
    with dp.auto_deprovision():
        dp.provision()
        dp.queue_copy(self.src_bucket, self.dst_bucket, recursive=True)
        tracker = dp.run_async()

We can also add reporting on the transfer:

    with dp.auto_deprovision():
        ...
        print("Waiting for transfer to complete...")
        while True:
            bytes_remaining = tracker.query_bytes_remaining()
            if bytes_remaining is None:
                print(f"{timestamp} Transfer not yet started")
            elif bytes_remaining > 0:
                print(f"{(bytes_remaining / (2 ** 30)):.2f}GB left")
            else:
                break
            time.sleep(1)
        tracker.join()
        print("Transfer complete!")

Building from source#

Prerequisite: install development dependencies#

For Skyplane development, we use a variety of tools to check for code quality and to build documentation. To install these tools, run the following command:

pip install -r requirements-dev.txt

Setting up a developer environment#

Skyplane is composed of the client (runs locally on a user’s laptop) and gateway VMs (runs in respective clouds). Normally, the gateways use a pre-built nightly Docker image containing the latest build of the Skyplane gateway code (public.ecr.aws/s6m1p0n8/skyplane:edge). However, if you modify the gateway source (under skyplane/gateway), you will need to rebuild the gateway Docker image and push it to a container registry.

Ensure you have authenticated your Github account with Docker:

  • Install docker

curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
echo <PERSONAL_ACCESS_TOKEN> | sudo docker login ghcr.io -u <GITHUB_USERNAME> --password-stdin

Building and testing Skyplane#

Building and pushing a Skyplane Docker image#

To package the code into a Docker image and push it to a container registry on your account, run the following command (substitute YOUR_GITHUB_USERNAME_HERE for your Github username):

export SKYPLANE_DOCKER_IMAGE=$(bash scripts/pack_docker.sh <YOUR_GITHUB_USERNAME_HERE>)

This will build the Skyplane Docker image for the gateway and push it ghcr.io under your user account. When running a Skyplane transfer, any provisioned gateways will pull this image from the SKYPLANE_DOCKER_IMAGE environment variable to ensure a reproducible environment.

First time setup: make sure ghcr image is “public”#

By default, new packages on ghcr are private. To make the package public so gateways can download the image, convert the package to public (you only need to do this once):

Package settings

  • Click the “Make public” button.

Make public

Building the Skyplane client#

We use Poetry to manage package dependencies during development. For convenience, we provide a Poetry wrapper via setup.py. To build the client, install the Skyplane package in development mode. The package points to your current checked-out version of the code, and any edits to the Skyplane client will immediately apply to the skyplane CLI command.

pip install -e ".[aws,azure,gcp]"
Testing a transfer#

We can run the skyplane CLI to test a transfer. The CLI will read your SKYPLANE_DOCKER_IMAGE environment variable and use that Docker image when launching gateways.

skyplane init
skyplane cp s3://... s3://...

Development Tips#

If testing transfers repeatedly, we recommend using the --reuse-gateways to reduce setup time.

Skyplane has debugging tools to ssh into gateway instances, view detailed transfer logs, and query chunk states during transfers. See Debugging Tools for more.

Contributing to Skyplane#

Welcome to Skyplane! Everyone is welcome to contribute to Skyplane. We are always looking for new features and improvements and we value everyone’s input.

Ways to Contribute#

There are many ways to contribute to Skyplane:

  • Answering questions on Skyplane’s discussions page

  • Improving Skyplane’s documentation

  • Filing bug reports or reporting sharp edges via Github issues

  • Contributing to our codebase.

Code Contributions#

We welcome pull requests, in particular for those issues marked with good first issue.

For other proposals or larger features, we ask that you open a new GitHub Issue or Discussion. We especially encourage external contributions to support additional cloud providers and object store endpoints.

To see more on how to setup a development environment, see the Development Guide.

Submitting pull requests#

Before you submit a pull request, make sure to complete the following steps:

  1. Fork the Skyplane repository to create a copy of the project in your own account.

  2. Set up a developer environment as described in the Development Guide.

  3. Create a development branch (git checkout -b feature_name)

  4. Test your changes manually using skyplane cp and with the unit test suite:

    $ pytest tests/unit_nocloud/
    
  5. Ensure your code is autoformatted and passes type checks:

    $ pip install -r requirements-dev.txt
    $ black -l 140 .
    $ pytype --config .pytype.cfg skyplane
    $ autoflake --in-place --remove-all-unused-imports --remove-unused-variables --recursive skyplane
    
  6. If you updated documentation, test the docs:

    $ pip install sphinx-autobuild
    $ cd docs
    $ pip install -r requirements.txt
    $ sphinx-autobuild -b html . /tmp/docs_build
    
  7. Commit your changes using a descriptive commit message.

  8. Create a pull request on the main Skyplane repo from your fork. Consult Github Help for more details.

Debugging Tools#

Skyplane has built-in tools for debugging during development.

Logs#

Client Logs#

When you run a transfer, the client logs will be written to a folder inside of the /tmp/skyplane/transfer_logs/ directory.

Gateway Logs#

Inside the client.log file, for each provisioned Skyplane gateway there will be the lines:

[INFO]     Log viewer: http://127.0.0.1:[PORT]/container/[CONTAINER_ID] 
[INFO]     API: http://127.0.0.1:[PORT]

which correspond to the gateway log viewer and the gateway chunk API. You can view the logs for each gateway by going to the “Log viewer” address for that gateway.

Chunk API#

The gateway chunk API allows for status of chunks on each gateway to be queried externally, and is used by the Skyplane client to monitor transfers.

    * GET /api/v1/status - returns status of API
    * GET /api/v1/servers - returns list of running servers
    * GET /api/v1/chunk_requests - returns list of chunk requests (use {'state': '<state>'} to filter)
    * GET /api/v1/chunk_requests/<int:chunk_id> - returns chunk request
    * POST /api/v1/chunk_requests - adds a new chunk request
    * PUT /api/v1/chunk_requests/<int:chunk_id> - updates chunk request
    * GET /api/v1/chunk_status_log - returns list of chunk status log entries

Gateway ssh#

You can ssh into a gateway with:

skyplane ssh

which will list available gateways that you can select to ssh into.

Once you’ve ssh-ed into a gateway instance, you can interact with the Skyplane docker image (??).

API documentation#

Welcome to Skyplane API!#

Skyplane’s ease of use and efficiency make it a popular choice among developers for handling data transfer across the cloud. With its API, developers can easily integrate Skyplane into their project pipelines (e.g. ML training example) and utilize the same functionality as the command line interface (CLI) such as copying and syncing, along with additional features exclusive to the API. This opens up new opportunities for the growth and application of Skyplane in the future.

Note

This API is still experimental, and your feedback is much appreciated in improving it!

_images/overview.png

Installing Skyplane#

We have provided a detailed tutorial here.

How to import Skyplane#

To access Skyplane and its functions, you can import it in your Python code like this: .. code-block:: python

import skyplane

Really easy!

How to launch a Skyplane transfer job#

To start, you need to create a Skyplane client that reads the cloud keys. Every transfer job is managed by this client or the dataplane configured by the client.

Then, Skyplane provides two different ways to transfer: 1. simply copy that takes care of the VM provision and deprovision so you can launch a transfer with one line; 2. dataplane session that gives you more freedom to launch multiple transfer jobs together and asynchronously.

Dataplane calculates the optimal transfer topology between the source and the destination regions and can be reused to launch jobs later.

Below are the two examples using the methods mentioned above.

Example of how to use API simple copy that automatically deprovisions the VMs#
import skyplane

client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())
print(f"Log dir: {client.log_dir}/client.log")
client.copy(src="s3://skycamp-demo-src/synset_labels.txt", dst="s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False)
Example of how to use API dataplane session#
# from https://github.com/skyplane-project/skyplane/blob/main/examples/api_demo.py
import skyplane

client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())
print(f"Log dir: {client.log_dir}/client.log")
dp = client.dataplane("aws", "us-east-1", "aws", "us-east-2", n_vms=1)
with dp.auto_deprovision():
    dp.provision()
    dp.queue_copy(
        "s3://skycamp-demo-src/synset_labels.txt", "s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False
    )
    tracker = dp.run_async()
    # You can monitor the transfer by querying tracker.query_bytes_remaining()

Integrations with other applications#

ImageNet ML training example#

See full workflow here.

Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region.

In many cases, datasets and virtual machines (VMs) are located in different regions. This can lead to slow data transfer speeds and high costs for data egress fees when using cloud provider tools, such as aws s3 cp, to download data to the VM’s local disk. Skyplane offers a solution by allowing a fast and more cost-effective transfer of the dataset to an S3 bucket in the same region as the VM (e.g. US-West-2), with direct streaming of the data to the model without the need for downloading it to the local folder.

_images/imagenet.png

This process is as simple as adding just two lines of code, similar to the demonstration of the Skyplane simple copy.

# Step 1:  Create a Skyplane API client. It will read your AWS credentials from the AWS CLI by default
client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())

# Step 2:  Copy the data from the remote bucket to the local bucket.
client.copy(src=args.remote_s3_path, dst=args.local_s3_path, recursive=True)
Airflow SkyplaneOperator#

Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator, which can be utilized in data transfer tasks, such as replacing the S3toGCSOperator. The following example demonstrates a data analytics workflow where data is transferred from S3 to GCS to build a BigQuery dataset and then used in a PySpark data analysis job.

_images/airflow.png

Contents#

CLI documentation#

skyplane cp documentation

skyplane#

skyplane [OPTIONS] COMMAND [ARGS]...

Options

--install-completion <install_completion>#

Install completion for the specified shell.

Options:

bash | zsh | fish | powershell | pwsh

--show-completion <show_completion>#

Show completion for the specified shell, to copy it or customize the installation.

Options:

bash | zsh | fish | powershell | pwsh

init#

Initialize the Skyplane CLI with your cloud credentials

skyplane init [OPTIONS]

Options

-y, --non-interactive#

Run non-interactively

Default:

False

--reinit-azure, --no-reinit-azure#
Default:

False

--reinit-gcp, --no-reinit-gcp#
Default:

False

--reinit-ibm, --no-reinit-ibm#
Default:

False

--reinit-cloudflare, --no-reinit-cloudflare#
Default:

False

--disable-config-aws, --no-disable-config-aws#
Default:

False

--disable-config-azure, --no-disable-config-azure#
Default:

False

--disable-config-gcp, --no-disable-config-gcp#
Default:

False

--disable-config-ibm, --no-disable-config-ibm#
Default:

True

--disable-config-cloudflare, --no-disable-config-cloudflare#
Default:

False

cp#

Copy files between any two cloud object stores

skyplane cp [OPTIONS] SRC DST

Options

-r, --recursive#

If true, will copy objects at folder prefix recursively

Default:

False

--debug, --no-debug#

If true, will write debug information to debug directory.

Default:

False

--multipart, --no-multipart#

If true, will use multipart uploads.

Default:

True

-y, -f, --confirm#

Confirm all transfer prompts

Default:

False

-n, --max-instances <max_instances>#

Number of gateways

Default:

1

--max-connections <max_connections>#

Number of connections per gateway

Default:

32

--solver <solver>#

Solver to use for transfer

Default:

direct

-t, --tput <solver_required_throughput_gbits>#

Required throughput to be solved for in Gbps

Default:

1

Arguments

SRC#

Required argument

DST#

Required argument

sync#

Sync files between any two cloud object stores

skyplane sync [OPTIONS] SRC DST

Options

--debug, --no-debug#

If true, will write debug information to debug directory.

Default:

False

--multipart, --no-multipart#

If true, will use multipart uploads.

Default:

True

-y, -f, --confirm#

Confirm all transfer prompts

Default:

False

-n, --max-instances <max_instances>#

Number of gateways

Default:

1

--max-connections <max_connections>#

Number of connections per gateway

Default:

32

--solver <solver>#

Solver to use for transfer

Default:

direct

-t, --tput <solver_required_throughput_gbits>#

Required throughput to be solved for

Default:

1

Arguments

SRC#

Required argument

DST#

Required argument

ssh#

SSH into a running gateway.

skyplane ssh [OPTIONS]
deprovision#

Deprovision all resources created by skyplane.

skyplane deprovision [OPTIONS]

Options

-a, --all#

Deprovision all resources including networks.

Default:

False

--filter-client-id <filter_client_id>#

Only deprovision instances with this client ID under the instance tag.

Benchmarks#

This page compares the performance of Skyplane and cloud data transfer tools such as AWS DataSync and Google Cloud Data Transfer. Achieved results vary depending on the region tested, the size of the file, and the number of VMs used.

Benchmark: Large file benchmark#

In this benchmark, we measure the impact of striping on large file transfers. We transfer a single large file between two AWS regions. Overall, Skyplane is 113.4x faster than the AWS DataSync for the largest tested file transfer.

Large file benchmark

Benchmark setup:

  • Source: AWS ap-southeast-2 (Sydney)

  • Destination: AWS eu-west-3 (Paris)

  • Number of files: 1x

  • File size: {4, 8, 16, 32, 64}GB

  • Number of Skyplane VMs: 8

Benchmark: Cost comparison w/ compression#

In this benchmark, we measure the impact of Skyplane’s decompression algorithm which saves a great deal of data egress fee. In addition, Skyplane does not charge any service fee. Overall, Skyplane is 6.2x cheaper than AWS DataSync when transfering uncompressed 220GB dump of English Wikipedia.

Cost comparison w/ compression

Benchmark setup:

  • Source: AWS us-east-1 (North Virginia)

  • Destination: AWS us-west-2 (Oregon)

  • Number of files: ~14,000x (sharded to 16 MB chunks)

  • File size: 223.67 GB

  • Number of Skyplane VMs: 8

Benchmark: ImageNet transfer on AWS#

In this benchmark, we measure the transfer speed and universal support of Skyplane API. For transfering a 70 GB fake imagenet, Skyplane supports transfers across AWS, GCP, and Azure. It completes the transfer in around 25 seconds for selected transfer region pair below. However, AWS DataSync only supports transfers into and out of AWS services, and it is slow.

DataSync_data_transfer

Benchmark setup:

  • Source: labeled on axis

  • Destination: labeled on axis

  • Number of files: 1,152x (training and validation batches)

  • File size: 70 GB

  • Number of Skyplane VMs: 8

Benchmark: ImageNet transfer on GCP#

In this benchmark, we measure the transfer speed and universal support of Skyplane API. For transfering a 70 GB fake imagenet, Skyplane supports transfers across AWS, GCP, and Azure. It completes the transfer in around 60 seconds for selected transfer region pair below. However, GCP Data Transfer only supports transfers into GCP services, and it is slow.

gcp_data_transfer

Benchmark setup:

  • Source: labeled on axis

  • Destination: labeled on axis

  • Number of files: 1,152x (training and validation batches)

  • File size: 70 GB

  • Number of Skyplane VMs: 8

Benchmark: Impact of Chunk Sizes on Transfer time and Throughput#

In this benchmark, we measure the transfer speed and throughput of a transfer with 100 roughly 140 MB files. The transfer was done from an S3 bucket in us-west-1 to another S3 bucket in us-east-1.

Chunk Size

Time

Throughput

5 MiB

81.91 s

1.36 Gbps

8 MiB

34.64 s

3.23 Gbps

16 MiB

19.47 s

5.74 Gbps

32 MiB

19.89 s

5.62 Gbps

64 MiB

16.17 s

6.91 Gbps

128 MiB

18.20 s

6.14 Gbps

Instructions to replicate benchmarks#

To replicate select benchmarks, see the following guide:

Benchmark instructions#
Skyplane vs AWS DataSync small file#
Setup#

For this experiment, we choose fake_imagenet (from Google here) as the testing file and pick Sydney and Paris as two endpoints to transfer. The fake_imagenet has size of about 70 GB and contains batches of training, validation, and test datasets.

Skyplane#

To time the entire process of Skyplane transfer, we use linux time command. We add -y argument which enables auto confirm to reduce time waiting for response. We use 8 VMs per source/destination region. Below is an example of the code we run:

$ time skyplane cp -r -y s3://{bucket name in ap-southeast-2 region}/fake_imagenet/ s3://{bucket name in eu-west-3 region}/fake_imagenet/ -n 8

In the output, you will find

✅ Transfer completed successfully
Transfer runtime: 28.11s, Throughput: 19.85Gbps

This is the result we include in the plot where we compare only the transfer time (provisioning time, etc. are excluded).

DataSync#

We use AWS DataSync for comparison. Instructions on how to set up it can be found here. Then, we can launch a task for transfer. Below is a snapshot of such task:

DataSync_fake_imagenet

Skyplane vs GCP Data Transfer small file#
Setup#

For this experiment, we choose fake_imagenet (from Google here) as the testing file and pick different regions to transfer. The fake_imagenet has size of about 70 GB and contains batches of training, validation, and test datasets.

Skyplane#

To time the entire process of Skyplane transfer, we use linux time command. We add -y argument which enables auto confirm to reduce time waiting for response. We use 8 VMs per source/destination region. Below is an example of the code we run:

$ time skyplane cp -r -y s3://{bucket name in source region}/fake_imagenet/ gs://{bucket name in destination region}/fake_imagenet/ -n 8

In the output, you will find

✅ Transfer completed successfully
Transfer runtime: 28.73s, Throughput: 19.42Gbps

This is the result we include in the plot where we compare only the transfer time (provisioning time, etc. are excluded).

Data Transfer#

We use Google Cloud Data Transfer for comparison. Instructions on how to set up it can be found here. Then, we can launch a job for transfer. Below is a snapshot of such job:

GCP_fake_imagenet

Skyplane vs AWS DataSync for large files#
Setup#

For this experiment, we choose randomly generated large files of fixed sizes as the testing file and pick Sydney and Paris as two endpoints to transfer. The random files have sizes of 4 GB, 8 GB, 16 GB, 32 GB, and 64 GB, each generated by linux command head -c {size}G </dev/urandom>myfile.

Skyplane#

To time the entire process of Skyplane transfer, we use linux time command. We add -y argument which enables auto confirm to reduce time waiting for response. We use 8 VMs per source/destination region. Below is an example of the code we run:

$ time skyplane cp -r -y s3://{bucket name in ap-southeast-2 region}/random_file_64GB s3://{bucket name in eu-west-3 region}/random_file_64GB -n 8

In the output, you will find

✅ Transfer completed successfully
Transfer runtime: 76.59s, Throughput: 6.69Gbps

This is the result we include in the plot where we compare only the transfer time (provisioning time, etc. are excluded).

DataSync#

We use AWS DataSync for comparison. Instructions on how to set up it can be found here. Then, we can launch a task for transfer. Below is a snapshot of such task:

DataSync_random_file_64GB

Skyplane vs AWS DataSync vs Rsync for transfer cost#
Setup#

For this experiment, we choose uncompressed 223GB dump of English Wikipedia (from here) as the testing file and pick North Virginia and Oregon as two endpoints to transfer. The wikipedia dump is first uncompressed to tar file and then sharded to 16 MB chunks by linux command split -b.

Skyplane#

To time the entire process of Skyplane transfer, we use linux time command. We add -y argument which enables auto confirm to reduce time waiting for response. We use 1 VM per source/destination region. Below is an example of the code we run:

$ time skyplane cp -r -y s3://{bucket name in us-east-1 region}/wiki_dumps s3://{bucket name in us-west-2 region}/wiki_dumps -n 1

In the output, you will find

✅ Transfer completed successfully
Transfer runtime: 266.23s, Throughput: 6.26Gbps

This is the transfer time (provisioning time, etc. are excluded) that costs us money for using AWS EC2 VMs.

You will also find

Compression saved 78.94% of egress fees

This indicates the percentage of the data transfer (egress) fee saved by Skyplane’s own compression algorithm.

To calculate the cost, we need two numbers: 1. the cost for transfering data out from N. Virginia to Oregon, which is $0.02 per GB; 2. the cost for using on-demand VMs (of type m5.8xlarge) in N. Virginia and Oregon, which is $1.536 per hour. The values are found from AWS here.

The total cost for Skyplane transfer is

223.67 GB * (1 - 0.7894) * $0.02 / GB + 2 * $(1.536 / 3600) / s * 266.23 s = $1.17
DataSync#

We use AWS DataSync for comparison. Instructions on how to set up it can be found here. Then, we can launch a task for transfer. Below is a snapshot of such task:

DataSync_wiki_dumps

To calculate the cost, we need one extra number: the cost for using AWS DataSync service to copy data, which is $0.0125 per GB. It can be found here.

The total cost for AWS DataSync transfer is

223.67 GB * ($0.02 + $0.0125) / GB = $7.27

which is independent on the transfer speed.

Rsync#

We also use rsync for comparison. First we need to mount AWS S3 buckets in AWS EC2 VMs by using the s3fs tool. An example command would be s3fs mybucket /path/to/mountpoint. Then, we can treat folders in EC2 VMs as folders in S3 buckets and use rsync to transfer. To do that, we have one VM in Oregon as local client and treat the other VM in N. Virginia as remote client.

In addition, rsync can compress the file data as it is sent to the destination client. We do so by rsync -z command argument. It uses zlib to compress the data, and by default level of compression, it saves 86% of egress fee. The percentage is found by the following command:

pigz -6 -z -r -v wiki_dumps/

To time the entire process of rsync transfer, we use linux time command. Finally, below is an example of the code we run:

time rsync -az -e "ssh -i .ssh/{key to access Virginia VM}.pem" ubuntu@Virginia_VM-IP-Address:~/us-east-1/ us-west-2/

In the output, you will find

real    396m46.729s
user    0m23.663s
sys     0m19.536s

The real time is how long we have used our AWS EC2 VMs.

The total cost for rsync transfer is

223.67 GB * (1 - 0.8608) * $0.02 / GB + 2 * $(1.536 / 3600) / s * 23807 s = $20.94

Roadmap#

Skyplane is an actively developed project. It will have đŸ”Ș SHARP EDGES đŸ”Ș. Please file an issue if you encounter bugs.

If you are interested in contributing, please see our contributor guide.

Skyplane 0.2.0 release (current)#

  • Firewalled VM-to-VM VPCs for Azure.

  • Large object transfer support for Azure.

  • Integration testing suite in CI.

  • Improvements to the scalability of transfers with many files (>10000).

Upcoming#

  • On-prem transfer support (design needed).

  • Broadcast support to multiple destinations.

  • Skyplane managed buckets for cross-cloud object storage.