Welcome to Skyplane!#
Note
This project is under active development.
đ„ Blazing fast bulk data transfers between any cloud đ„
pip install skyplane[aws]
skyplane init
skyplane [sync/cp] [local/s3/gs/azure]://mybucket/big_dataset [local/s3/gs/azure]://mybucket2/
Skyplane is a tool for blazingly fast bulk data transfers between object stores in the cloud. It provisions a fleet of VMs in the cloud to transfer data in parallel while using compression and bandwidth tiering to reduce cost.
Skyplane is:
đ„ Blazing fast (110x faster than AWS DataSync)
đ€ Cheap (4x cheaper than rsync)
đ Universal (AWS, Azure and GCP)
You can use Skyplane to transfer data:
between object stores within a cloud provider (e.g. AWS us-east-1 to AWS us-west-2)
between object stores across multiple cloud providers (e.g. AWS us-east-1 to GCP us-central1)
between local storage and cloud object stores (experimental)
Skyplane currently supports the following source and destination endpoints (any source and destination can be combined):
Endpoint |
Source |
Destination |
---|---|---|
AWS S3 |
â |
â |
Google Storage |
â |
â |
Azure Blob Storage |
â |
â |
Local Disk |
â |
(in progress) |
Skyplane is an actively developed project. It will have đȘ SHARP EDGES đȘ. Please file an issue or ask the contributors via the #help channel on our Slack if you encounter bugs.
Contents#
Installation#
Weâre ready to install Skyplane. Itâs as easy as:
---> Install skyplane from PyPI:
$ pip install "skyplane[aws]"
# install support for other clouds as needed:
# $ pip install "skyplane[azure]"
# $ pip install "skyplane[gcp]"
# $ pip install "skyplane[all]"
Note
If you are using an M1 Macbook with the arm64 architecture, you will need to install skyplane as follows:
$ GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install skyplane[all]
Setting up Cloud Credentials#
Skyplane needs access to cloud credentials to perform transfers. To get started with setting up credentials, make sure you have cloud provider CLI tools installed:
---> For AWS:
$ pip install awscli
---> For Google Cloud:
$ pip install gcloud
---> For Azure:
$ pip install azure
Once you have the CLI tools setup, log into each cloud providerâs CLI:
---> For AWS:
$ aws configure
---> For Google Cloud:
$ gcloud auth application-default login
---> For Azure:
$ az login
Now, you can initialize Skyplane with your desired cloud providers. Skyplane autodetects cloud credentials and valid regions from your CLI environment.
---> Setup cloud provider connectors:
$ skyplane init
Transferring Data via Skyplane CLI#
Weâre ready to use the Skyplane CLI! Letâs use skyplane cp to copy files from AWS to GCP:
---> đž Ready to rock and roll! Copy some files:
$ skyplane cp -r s3://... gs://...
To transfer only new objects, you can instead use skyplane sync:
---> Copy only diff
$ skyplane sync s3://... gs://...
To transfer from local disk or HDFS cluster, you can use skyplane cp as well:
(Note: On-Prem require additional setup. Please navigate to the On-Prem section for more details)
---> Copy from local disk
$ skyplane cp -r /path/to/local/file gs://...
---> Copy from HDFS
$ skyplane cp -r hdfs://... gs://...
Quickstart#
CLI#
The simplest way to run transfers on Skyplane is to use the CLI. To transfer files from a AWS to GCP, you can run:
skyplane cp -r s3://... gs://...
You can also sync directories to avoid copying data that is already in the destination location:
skyplane sync s3://... gs://...
Python API#
You can also run skyplane from a Python API client. To copy a single object or folder, you can run:
import skyplane
client = skyplane.SkyplaneClient()
client.copy(src="s3://bucket-src/key", dst="s3://bucket-dst/key", recursive=False)
This will create a skyplane dataplane (i.e. cluster), execute the transfer, and tear down the cluster upon completion.
You can also execute multiple transfers on the same dataplane to reduce overhead from VM startup time. To do this, you can define a dataplane object and provision it:
dp = client.dataplane("aws", "us-east-1", "aws", "us-east-2", n_vms=8)
dp.provision()
This will create a dataplane for transfers between us-east-1
and us-east-2
with 8 VMs per region. Now, we can queue transfer jobs in this dataplane:
# queue transfer
dp.queue_copy("s3://bucket1/key1", "s3://bucket2/key1")
dp.queue_copy("s3://bucket1/key2", "s3://bucket2/key2")
# execute transfer
tracker = dp.run_async()
# monitor transfer status
remaining_bytes = tracker.query_bytes_remaining()
The queued transfer wonât run until you call dp.run()
or dp.run_async()
. Once you run the transfer, you can moniter the transfer with the returned tracker
object. Once the transfer is completed, make sure the deprovision the dataplane to avoid cloud costs:
# tear down the dataplane
dp.deprovision()
You can have Skyplane automatically deprovision dp.auto_deprovision()
:
with dp.auto_deprovision():
dp.provision()
dp.queue_copy(...)
tracker = dp.run_async()
Now you can programmatically transfer terabytes of data across clouds! To see some examples of applications you can build with the API, you can check out our tutorials on how to load training data from another region and build an Airflow operator.
Configuration#
Skyplane comes with a variety of knobs to tune to adjust performance or change how VMs are provisioned. You can configure these options using the skyplane config set <key> <value>
command.
Increasing performance of transfers via paralllelism#
Skyplane transfers files in parallel across a set of gateway VMs. The total bandwidth of a transfer is mostly determined by the number of VMs used. Skyplane will automatically partition and distribute the transfer across the set of gateway VMs.
To increase the parallelism of a transfer, run:
$ skyplane config set max_instances <number>
By default, Skyplane will use a maximum of 1 VM in each region. This limit is conservative since too many VMs can potentially result failed transfers due to VCPU limits and other resource constraints.
If you encounter VCPU limits, increase your VCPU limits following this guide:
Increasing cloud vCPU limits#
Skyplane utilizes parallel VMs to transfer data at high rates. However, if you do not have sufficient vCPU limits to support the number of VMs you need, you can increase the number of VMs you can use by requesting a quota increase from each respective cloud.
Increasing AWS quotas#
To increase your AWS quota, follow these steps:
Go to the AWS EC2 console
Select the region you want to increase your quota for from the top right corner of the page.
Select the Quotas tab.
Search for âRunning On-Demand Standard (A, C, D, H, I, M, R, T, Z) instancesâ and select the radio button on the left.
Click âRequest quota increaseâ.
In the form, request a higher vCPU allocation.
By default, each Skyplane VM uses 32 vCPUs and provides up to 5Gbps of network throughput on AWS.
Example: If youâd like to use 8 VMs per region, request 256 vCPUs.
Click âRequestâ. AWS can take up to a few days to review the request.
Increasing Azure quotas#
To increase your Azure quota, follow these steps:
Go to the Azure Quota request page in the Azure portal.
Filter the quotas by your subscription:
Under Search, enter âStandard Dv4 Family vCPUsâ.
Under the service dropdown, select âComputeâ.
Under the subscription dropdown, select your Azure subscription.
Under region, select the regions you want to increase your quota for.
Checkmark all the quotas you want to increase.
Click âRequest quota increaseâ and select âEnter a new limitâ.
Enter the number of vCPUs you want to increase your quota for.
By default, each Skyplane VM uses 32 vCPUs and provides up to 12.5Gbps on Azure.
Example: If youâd like to use 8 VMs per region, enter 256 vCPUs.
Click âSubmitâ. Azure can take up to a few days to review the request.
Increasing GCP quotas#
To increase your GCP quotas, follow these steps:
Go to the GCP Console.
Select your project from the top dropdown.
Search for âAll quotasâ and navigate to the All quotas page.
Filter the quota list:
Under âServiceâ, select âCompute Engine APIâ.
Under âQuotaâ, select âN2 CPUsâ
Select all regions you want to increase your quota for.
Click âEdit quotasâ.
Increase the number of vCPUs you want to increase your quota for.
By default, each Skyplane VM uses 32 vCPUs and provides up to 7Gbps of network throughput on GCP.
Example: If youâd like to use 8 VMs per region, request 256 vCPUs.
Click âSubmit requestâ. GCP can take up to a few days to review the request. In some cases, limit increases are approved immediately via automated review.
Reusing Gateways#
It can take 45s to 2m to provision gateway VMs for a transfer. If you are repeatedly transferring data between the same pair of regions, you can reuse gateway VMs. By default, Skyplane terminates these VMs to avoid unnecessary VM costs.
When running a cp
or sync
command, pass the --reuse-gateways
flag to Skyplane to reuse gateway VMs:
$ skyplane cp --reuse-gateways <source> <destination>
$ skyplane sync --reuse-gateways <source> <destination>
We will attempt to automatically deprovision these gateways after 15 minutes by default. Change this interval via skyplane config set autoshutdown_minutes <minutes>
. With --reuse-gateways
, Skyplane will start a background job on each gateway VM that triggers a VM shutdown after the specified delay. Note you will still pay for associated VM costs such as storage and network IPs even if VMs are shut down.
To ensure that all gateways are stopped and no longer incur charges, run:
$ skyplane deprovision
Spot Instances to reduce instance costs#
Spot instances reduce the cost of provisioning VMs. These instances are charged at a lower price than on-demand instances but can be preempted at any time. If this occurs, the transfer will fail.
To use spot instances, run:
$ skyplane config set aws_use_spot_instances True
$ skyplane config set azure_use_spot_instances True
$ skyplane config set gcp_use_spot_instances True
Configuring networking between gateways#
Skyplane supports encrypting data end-to-end. This is useful for encrypting data that is stored on a local disk. We enable end-to-end encryption by default. To disable it, run:
$ skyplane config set encrypt_e2e false
Skyplane automatically compresses data at the source region to reduce egress fees from data transfer. We use the LZ4 compression algorithm by default as it can compress data at line rate. To disable compression, run:
$ skyplane config set compress false
Skyplane continually computes checksums at the source region to verify data integrity. We use the MD5 checksum algorithm by default. To disable checksum verification, run:
$ skyplane config set verify_checksums false
Optionally and in addition to end-to-end encryption, Skyplane supports sending data over a TLS encrypted socket. We donât generally recommend using this unless you have a specific security requirement as it can increase performance variablity of transfers. To enable further TLS socket encryption, run:
$ skyplane config set encrypt_socket_tls true
Architecture#
Skyplane performs high-performance, cost-efficient, bulk data transfers by parallelizing transfers, provisioning resources for transfers, and identifying optimal transfer paths. Skyplane profiles cloud network cost and throughput across regions, and borrows ideas from RON to identify optimal transfer paths across regions and cloud providers.
To learn about how Skyplane works, please see our talk here:
On prem support#
Skyplane now supports local to cloud object store transfers. For this, Skyplane defaults to awscli (for AWS), and gsutil (for GCP). Let us know if you would like to see on-prem support for Azure.
Transfer Integrity and Checksumming#
Skyplane takes several steps to ensure the correctness of transfers. To ensure that data is transferred without corruption (e.g. bit flips or missing byte ranges), Skyplane will compute checksums for data at the source region and verify data matches the checksum before writing back to the destination region. To ensure that no files are dropped during the transfer, Skyplane will query the destination object store after a transfer and check all files were copied with the correct file size. To verify checksums for whole-file transfers, Skyplane computes MD5 hashes at the source region. Upon writing data at the destination, hashes are validated directly in the destintation object store. For multipart transfers, hashses are validated at the destination VM before writing to the object store.
Security#
Data transfers in Skyplane are encrypted end-to-end. This means that all of the data chunks in each transfer are encrypted in the source region, transferred over the network (including through any relay regions) in encrypted form, and decrypted only when they reach destination region. Within the source and destination regions data may be handled in plaintext. For example, chunks are decrypted at the destination gateways and are inserted into the destination object store. For stronger security, the application using Skyplane may choose to store data in the source object store in encrypted form, so that it remains encrypted even in the source and destination regions. To afford higher efficiency for these use cases, Skyplane allows its own encryption to be disabled, to avoid encrypting data twice. The keys used for Skyplaneâs end-to-end encryption are generated at the client and then communicated to the gateways over SSH.
HTTP/REST calls made between gateways are enrypted separately, using TLS.
Owing to the above encryption mechanisms, Skyplane guarantees confidentiality against a passive adversary who can view data transferred over the wide-area network and in relay regions. Such an adversary cannot see the contents of the data, but it can potentially see the following:
The quantity of data transferred.
The network path and overlay path taken by each chunk during the transfer.
The size of each chunk (which may be related to the size of the files/objects being transferred).
The timing of each chunkâs transfer between gateways and over the network.
Firewalls#
Skyplane adopts best practices to ensure data and gateway nodes are secure during transfers. In this section, we describe the design in-brief. Firewalls are enabled by default, and we advise you not to turn them off. This ensures not only is the data secure in flight, but also prevents gateways from being compromised. Our approach of having unique skyplane
VPC and firewalls guarantees that your default networks remain untouched, and we have also architected it to allow for multiple simultaneous transfers! If you have any questions regarding the design and/or implementation we encourage you to open an issue with [Firewall]
in the title.
GCP#
Skyplane creates a global VPC called skyplane
when it is invoked for the first time with a new subscription-id. Instances and firewall rules are applied on this VPC and do NOT interfere with the default
GCP VPC. This ensures all the changes that Skyplane introduces are localized within the skyplane VPC - all instances and our firewalls rules only apply within the skyplane
VPC. The skyplane
global VPC consists of skyplane
sub-nets for each region.
During every skyplane
transfer, a new set of firewalls are created that allow IPs of all instances that are involved in the transfer to exchange data with each other. These firewalls are set with priority 1000, and are revoked after the transfer completes. All instances can be accessed via ssh
on port 22, and respond to ICMP
packets to aid debugging.
AWS#
While GCP VPCs are Global, in AWS for every region that is involved in a transfer, Skyplane creates a skyplane
VPC, and a security group (SG). During transfers, firewall rules are instantiated that allow all IPs of gateway instances involved in the transfer to relay data with each other. Post the transfer, the firewalls are deleted.
Azure#
Firewall support for Azure is in the roadmap.
Large Objects#
Skyplane breaks large objects into smaller sub-parts (currently AWS and GCP only) to improve transfer parallelism (also known as striping).
Performance Stats Collection#
Skyplane collects performance statistics. Some of this data is shared with the Skyplane developers to improve the accuracy of the Skyplane solver and for research purposes.
Here are the guiding principles of our collection policy:
No surprises â you will be notified before we begin collecting data. You will be notified of any changes to the data being collected or how it is used.
Easy opt-out: You will be able to easily opt-out of data collection.
Transparency â you will be able to review all data that is sent to us.
Control â you will have control over your data, and we will honor requests to delete your data.
We will not collect any personally identifiable data or proprietary code/data.
We will not sell data or buy data about you.
What data is collected#
We collect non-sensitive data that helps us understand how Skyplane is used. Personally identifiable data will never be collected. Please check /tmp/skyplane/usage/{client_id}/{session_id}/usage_stats.json
json file to see the data we collect. Explicitly, we collect:
System and OS information (OS version, kernel version, Python version).
Anonymized client id and transfer session id.
Source region and destination region per transfer.
The collection of command arguments used in the transfer session.
Total runtime and the aggregated transfer speed in Gbps.
Error message if the transfer fails.
How to disable it#
There are two ways to disable performance stats collection:
Run
skyplane config set usage_stats false
to disable collection for all future transfers. This wonât affect currently running transfers. Under the hood, this command writes{"usage_stats": false}
to the global config file~/.skyplane/config.json
.Set the environment variable
SKYPLANE_USAGE_STATS_ENABLED
to 0, which temporarily disable the performance stats collection.
Currently there is no way to enable or disable collection for a running transfer; you have to stop and restart the transfer.
How does it work#
When Skyplane runs skyplane cp
or skyplane sync
command, it will decide whether performance stats collection should be enabled or not by considering the following factors in order:
It checks whether the environment variable
SKYPLANE_USAGE_STATS_ENABLED
is set: 1 means enabled and 0 means disabled.If the environment variable is not set, it reads the value of key
usage_stats
in the global config file~/.skyplane/config.json
: true means enabled and false means disabled. If there is no such key in global config file, then the performance stats collection is enabled by default.
Note: performance stats collection is first-time enabled by default when running skyplane init
.
Requesting removal of collected data#
To request removal of collected data, please email us at admin@skyplane.org with the client_id
that you can find in /tmp/skyplane/usage/{client_id}/{session_id}/usage_stats.json
.
Frequently Asked Questions (FAQ)#
Does the client_id and session_id map to personal data?
No, the uuid will be a random ID that cannot be used to identify a specific person nor machine. It will not live beyond the lifetime of your Skyplane transfer session; and is primarily captured to enable us to honor deletion requests.
The client_id
and session_id
are logged so that deletion requests can be honored.
Could an enterprise easily configure an additional endpoint or substitute a different endpoint?
We definitely see this use case and would love to chat with you to make this work â email admin@skyplane.org.
Contact us#
If you have any feedback regarding performance stats collection, please email us at admin@skyplane.org.
On-Prem Transfers#
Currently Skyplane supports On-prem from local disk, NFS, HDFS to cloud storages.
HDFS Setup#
Skyplane utilizes Pyarrow and libhdfs for HDFS connection.
Transfer from HDFS requires prior Hadoop and Java installation.
Please refer to Pyarrow HDFS documentation for necessary environment variable setup.
Note that the cluster needs to communicate to the Skyplane gateway. Please change the incoming firewall for the clusters to allow traffic from Skyplane.
Resolving HDFS Datanodes#
A file called hostname
is under the ./skyplane/scripts/on_prem
folder. This file will be used for hostname/datanode IP resolution. This is for datanodeâs internal IP resolution.
Copy the hostname/Internal IP for each datanode and the external ip for the corresponding datanode to the file.
The hostname after writing all the required information should look like this.
<External IP> <Datanodes' Hostname or Internal IP>
Testing the Transfer#
Now you can test running skyplane cp
to transfer from local disk or HDFS cluster to any cloud storages.
---> Copy from local disk
$ skyplane cp -r /path/to/local/file gs://...
---> Copy from HDFS
$ skyplane cp -r hdfs://... gs://...
FAQ#
I get InsufficientVCPUException
when running a transfer#
Skyplane needs to provision VMs in each cloud to complete a transfer. The VMs read data from the source object store and send it along sockets to the destination. By default, Skyplane will provision 1 VM in each region where each VM has 32 vCPUs. If you do not have sufficient quota to provision these VMs, please see our guide to requesting cloud limit increases.
Alternatively, you can configure Skyplane to use fewer resources per transfer. You can reduce the default number of VMs used via max_instances
and change the instance class for gateways by configuring aws_instance_class
, azure_instance_class
or gcp_instance_class
with a smaller instance class.
Does Skyplane support local transfers?#
Skyplane does not currently support local transfers. Cloud to cloud transfers are supported.
How does Skyplane map object keys from source to destination?#
In the non-recursive case, Skyplane extracts the key from the full bucket path (s3://[bucket_name]/[key]) and places the object specified at dest_prefix/[key] if the dest_prefix provided is a directory (ends in /
), or else replaces the object at dest_prefix with the extracted object.
In the recursive case, Skyplane appends a trailing slash to the source and dest paths if one does not already exist. After extracting the key from the source path, it is appended to the dest prefix to get object keys.
Troubleshooting MissingObjectException#
This exception is raised when:
no objects are found at the source prefix passed to the CLI
the source prefix is a directory and the
--recursive
flag is not setthere is a mismatch between the source prefix and the key for an object Skyplane is copying
To troubleshoot this exception, carefully check that the requested path is not empty and is accessible via the credential used by Skyplane. If this is the case, ensure that the --recursive
flag is set if the source prefix is a directory.
As an example, to transfer the directory s3://some-bucket/some-directory/
to s3://some-bucket/destination/
, you would run skyplane cp --recursive s3://some-bucket/some-directory/ s3://some-bucket/destination/
.
Troubleshooting MissingBucketException#
This exception is raised when:
the source bucket does not exist
the destination bucket does not exist
the source bucket is not accessible via the credential used by Skyplane
the destination bucket is not accessible via the credential used by Skyplane
Using the cloud providerâs console, verify the bucket exists. If so, ensure that Skyplane has access to the bucket.
Note
Requester pays buckets: If it is a public bucket, it may be a requester pays bucket. Any egress fees from this bucket will be paid by the requester (i.e. Skyplane) instead of the bucketâs owner. By default Skyplane disables support for requester pays to avoid unexpected egress charges.
To enable support for requester pays buckets, run skyplane config set requester_pays true
.
How can I switch between GCP projects?#
We recommend re-setting GCP credentials locally by running rm -r ~/.config/gcloud
then re-running gcloud auth application-default login
. You can then set the project ID you want with gcloud config set project <PROJECT_ID>
. Once youâve updated authentication and the project, you can run `skyplane init âreinit-gcpâ.
If you get a an error saying Compute Engine API has not been used in project 507282715251 before or it is disabled
, wait a few minutes for the API enablement to take effect and re-run skyplane init
.
Faster Training Data Loading#
This tutorial explains how you can use the Skyplane API to quickly download data from an object store located in a different region or cloud than your training instance. See full workflow here.
Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region.
Remote vs. Local Regions#
Say that you have a VM for running training jobs in an AWS region, us-west-2
. Reading data from a same-region S3 bucket will be very fast and free. However, if your data is in another region or cloud provider, read the data will be much slower and also charge you per-GB egress fees. In this tutorial, we assume that our data is in a bucket in us-east-1
(the remote region), but we are running training from another region us-west-2
(the local region).
Reading data from S3#
Directly reading data from S3 can be convinient to avoid having to download your entire dataset before starting to train. In this tutorial, we create an ImageNetS3
dataset that extends AWSâs S3IterableDataset
object.
import skyplane
import torch
import torchvision.transforms as transforms
from torch.utils.data import IterableDataset, DataLoader
from awsio.python.lib.io.s3.s3dataset import S3IterableDataset
class ImageNetS3(IterableDataset):
def __init__(self, url_list, shuffle_urls=True):
self.s3_iter_dataset = S3IterableDataset(url_list, shuffle_urls)
self.transform = transforms.Compose(
[
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
]
)
def data_generator(self):
try:
while True:
# Based on aplhabetical order of files sequence of label and image will change.
# e.g. for files 0186304.cls 0186304.jpg, 0186304.cls will be fetched first
_, label_fobj = next(self.s3_iter_dataset_iterator)
_, image_fobj = next(self.s3_iter_dataset_iterator)
label = int(label_fobj)
image_np = Image.open(io.BytesIO(image_fobj)).convert("RGB")
# Apply torch visioin transforms if provided
if self.transform is not None:
image_np = self.transform(image_np)
yield image_np, label
except StopIteration:
return
We can create a data loader with the data located in our remote bucket:
remote_bucket_url = "s3://us-east-1-bucket"
data_urls = [
(remote_bucket_url + "/" if not remote_bucket_url.endswith("/") else remote_bucket_url) + f"imagenet-train-{i:06d}.tar"
for i in range(100)
]
dataset = ImageNetS3(data_urls)
train_loader = DataLoader(dataset, batch_size=256, num_workers=2)
However, the latency of this dataloader will be very high and likely degrade training performance.
Tranferring Data with Skyplane#
We can improve our data loaderâs performance by transferring data to a local region first. We can do this by running:
local_bucket_url = "s3://us-west-2-bucket"
# Step 1: Create a Skyplane API client. It will read your AWS credentials from the AWS CLI by default
client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())
# Step 2: Copy the data from the remote bucket to the local bucket.
client.copy(src=remote_bucket_url, dst=local_bucket_url, recursive=True)
Once the copy completes, the following code will be able to read the training data from the bucket with low latency, and no egress cost:
data_urls = [
(local_bucket_url + "/" if not local_bucket_url.endswith("/") else local_bucket_url) + f"imagenet-train-{i:06d}.tar"
for i in range(100)
]
dataset = ImageNetS3(data_urls)
train_loader = DataLoader(dataset, batch_size=256, num_workers=2)
Creating an Airflow Operator#
Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator, which can be utilized in data transfer tasks, such as replacing the S3toGCSOperator. The following example demonstrates a data analytics workflow where data is transferred from S3 to GCS to build a BigQuery dataset and then used in a PySpark data analysis job.
In this tutorial, we extend Airflowâs BaseOperator
object to create a custom Skyplane operator, called SkyplaneOperator
. We first define the fields of the SkyplaneOperator
:
import skyplane
from airflow.models import BaseOperator # type: ignore
class SkyplaneOperator(BaseOperator):
template_fields = (
"src_provider",
"src_bucket",
"src_region",
"dst_provider",
"dst_bucket",
"dst_region",
"config_path",
)
def __init__(
self,
*src_provider: str,
src_bucket: str,
src_region: str,
dst_provider: str,
dst_bucket: str,
dst_region: str,
aws_config: Optional[skyplane.AWSConfig] = None,
gcp_config: Optional[skyplane.GCPConfig] = None,
azure_config: Optional[skyplane.AzureConfig] = None,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.src_provider = src_provider
self.src_bucket = src_bucket
self.src_region = src_region
self.dst_provider = dst_provider
self.dst_bucket = dst_bucket
self.dst_region = dst_region
self.aws_config = aws_config
self.gcp_config = gcp_config
self.azure_config = azure_config
def execute(self, context):
pass
Inside the execute
function, we can instantiate a Skyplane client to create a dataplane and execute transfers:
def execute(self, context):
client = SkyplaneClient(aws_config=self.aws_config, gcp_config=self.gcp_config, azure_config=self.azure_config)
dp = client.dataplane(self.src_provider, self.src_region, self.dst_provider, self.dst_region, n_vms=1)
with dp.auto_deprovision():
dp.provision()
dp.queue_copy(self.src_bucket, self.dst_bucket, recursive=True)
tracker = dp.run_async()
We can also add reporting on the transfer:
with dp.auto_deprovision():
...
print("Waiting for transfer to complete...")
while True:
bytes_remaining = tracker.query_bytes_remaining()
if bytes_remaining is None:
print(f"{timestamp} Transfer not yet started")
elif bytes_remaining > 0:
print(f"{(bytes_remaining / (2 ** 30)):.2f}GB left")
else:
break
time.sleep(1)
tracker.join()
print("Transfer complete!")
Building from source#
Prerequisite: install development dependencies#
For Skyplane development, we use a variety of tools to check for code quality and to build documentation. To install these tools, run the following command:
pip install -r requirements-dev.txt
Setting up a developer environment#
Skyplane is composed of the client (runs locally on a userâs laptop) and gateway VMs (runs in respective clouds). Normally, the gateways use a pre-built nightly Docker image containing the latest build of the Skyplane gateway code (public.ecr.aws/s6m1p0n8/skyplane:edge
). However, if you modify the gateway source (under skyplane/gateway
), you will need to rebuild the gateway Docker image and push it to a container registry.
Ensure you have authenticated your Github account with Docker:
Install docker
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
Create a Personal Access Token with âwrite:packagesâ permissions
Register the token with Docker:
echo <PERSONAL_ACCESS_TOKEN> | sudo docker login ghcr.io -u <GITHUB_USERNAME> --password-stdin
Building and testing Skyplane#
Building and pushing a Skyplane Docker image#
To package the code into a Docker image and push it to a container registry on your account, run the following command (substitute YOUR_GITHUB_USERNAME_HERE for your Github username):
export SKYPLANE_DOCKER_IMAGE=$(bash scripts/pack_docker.sh <YOUR_GITHUB_USERNAME_HERE>)
This will build the Skyplane Docker image for the gateway and push it ghcr.io under your user account. When running a Skyplane transfer, any provisioned gateways will pull this image from the SKYPLANE_DOCKER_IMAGE
environment variable to ensure a reproducible environment.
First time setup: make sure ghcr image is âpublicâ#
By default, new packages on ghcr are private. To make the package public so gateways can download the image, convert the package to public (you only need to do this once):
Navigate to your newly created package on Github at https://github.com/users/<YOUR_GITHUB_USERNAME_HERE>/packages/container/package/skyplane. Make sure to substitute your GitHub username for <YOUR_GITHUB_USERNAME_HERE>.
Click on the âPackage Settings button at the top right of the Skyplane package page with the gear icon.
Click the âMake publicâ button.
Building the Skyplane client#
We use Poetry to manage package dependencies during development. For convenience, we provide a Poetry wrapper via setup.py
. To build the client, install the Skyplane package in development mode. The package points to your current checked-out version of the code, and any edits to the Skyplane client will immediately apply to the skyplane
CLI command.
pip install -e ".[aws,azure,gcp]"
Testing a transfer#
We can run the skyplane
CLI to test a transfer. The CLI will read your SKYPLANE_DOCKER_IMAGE
environment variable and use that Docker image when launching gateways.
skyplane init
skyplane cp s3://... s3://...
Development Tips#
If testing transfers repeatedly, we recommend using the --reuse-gateways
to reduce setup time.
Skyplane has debugging tools to ssh
into gateway instances, view detailed transfer logs, and query chunk states during transfers. See Debugging Tools for more.
Contributing to Skyplane#
Welcome to Skyplane! Everyone is welcome to contribute to Skyplane. We are always looking for new features and improvements and we value everyoneâs input.
Ways to Contribute#
There are many ways to contribute to Skyplane:
Answering questions on Skyplaneâs discussions page
Improving Skyplaneâs documentation
Filing bug reports or reporting sharp edges via Github issues
Contributing to our codebase.
Code Contributions#
We welcome pull requests, in particular for those issues marked with good first issue.
For other proposals or larger features, we ask that you open a new GitHub Issue or Discussion. We especially encourage external contributions to support additional cloud providers and object store endpoints.
To see more on how to setup a development environment, see the Development Guide.
Submitting pull requests#
Before you submit a pull request, make sure to complete the following steps:
Fork the Skyplane repository to create a copy of the project in your own account.
Set up a developer environment as described in the Development Guide.
Create a development branch (
git checkout -b feature_name
)Test your changes manually using
skyplane cp
and with the unit test suite:$ pytest tests/unit_nocloud/
Ensure your code is autoformatted and passes type checks:
$ pip install -r requirements-dev.txt $ black -l 140 . $ pytype --config .pytype.cfg skyplane $ autoflake --in-place --remove-all-unused-imports --remove-unused-variables --recursive skyplane
If you updated documentation, test the docs:
$ pip install sphinx-autobuild $ cd docs $ pip install -r requirements.txt $ sphinx-autobuild -b html . /tmp/docs_build
Commit your changes using a descriptive commit message.
Create a pull request on the main Skyplane repo from your fork. Consult Github Help for more details.
Debugging Tools#
Skyplane has built-in tools for debugging during development.
Logs#
Client Logs#
When you run a transfer, the client logs will be written to a folder inside of the /tmp/skyplane/transfer_logs/
directory.
Gateway Logs#
Inside the client.log
file, for each provisioned Skyplane gateway there will be the lines:
[INFO] Log viewer: http://127.0.0.1:[PORT]/container/[CONTAINER_ID]
[INFO] API: http://127.0.0.1:[PORT]
which correspond to the gateway log viewer and the gateway chunk API. You can view the logs for each gateway by going to the âLog viewerâ address for that gateway.
Chunk API#
The gateway chunk API allows for status of chunks on each gateway to be queried externally, and is used by the Skyplane client to monitor transfers.
* GET /api/v1/status - returns status of API
* GET /api/v1/servers - returns list of running servers
* GET /api/v1/chunk_requests - returns list of chunk requests (use {'state': '<state>'} to filter)
* GET /api/v1/chunk_requests/<int:chunk_id> - returns chunk request
* POST /api/v1/chunk_requests - adds a new chunk request
* PUT /api/v1/chunk_requests/<int:chunk_id> - updates chunk request
* GET /api/v1/chunk_status_log - returns list of chunk status log entries
Gateway ssh
#
You can ssh
into a gateway with:
skyplane ssh
which will list available gateways that you can select to ssh into.
Once youâve ssh-ed into a gateway instance, you can interact with the Skyplane docker image (??).
API documentation#
Welcome to Skyplane API!#
Skyplaneâs ease of use and efficiency make it a popular choice among developers for handling data transfer across the cloud. With its API, developers can easily integrate Skyplane into their project pipelines (e.g. ML training example) and utilize the same functionality as the command line interface (CLI) such as copying and syncing, along with additional features exclusive to the API. This opens up new opportunities for the growth and application of Skyplane in the future.
Note
This API is still experimental, and your feedback is much appreciated in improving it!

Installing Skyplane#
We have provided a detailed tutorial here.
How to import Skyplane#
To access Skyplane and its functions, you can import it in your Python code like this: .. code-block:: python
import skyplane
Really easy!
How to launch a Skyplane transfer job#
To start, you need to create a Skyplane client that reads the cloud keys. Every transfer job is managed by this client or the dataplane configured by the client.
Then, Skyplane provides two different ways to transfer: 1. simply copy that takes care of the VM provision and deprovision so you can launch a transfer with one line; 2. dataplane session that gives you more freedom to launch multiple transfer jobs together and asynchronously.
Dataplane calculates the optimal transfer topology between the source and the destination regions and can be reused to launch jobs later.
Below are the two examples using the methods mentioned above.
import skyplane
client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())
print(f"Log dir: {client.log_dir}/client.log")
client.copy(src="s3://skycamp-demo-src/synset_labels.txt", dst="s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False)
# from https://github.com/skyplane-project/skyplane/blob/main/examples/api_demo.py
import skyplane
client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())
print(f"Log dir: {client.log_dir}/client.log")
dp = client.dataplane("aws", "us-east-1", "aws", "us-east-2", n_vms=1)
with dp.auto_deprovision():
dp.provision()
dp.queue_copy(
"s3://skycamp-demo-src/synset_labels.txt", "s3://skycamp-demo-us-east-2/imagenet-bucket/synset_labels.txt", recursive=False
)
tracker = dp.run_async()
# You can monitor the transfer by querying tracker.query_bytes_remaining()
Integrations with other applications#
ImageNet ML training example#
See full workflow here.
Large-scale machine learning (ML) training typically includes a step for acquiring training data. The following example illustrates an ML workflow where the original ImageNet data is stored in an S3 bucket in the US-East-1 region.
In many cases, datasets and virtual machines (VMs) are located in different regions. This can lead to slow data transfer speeds and high costs for data egress fees when using cloud provider tools, such as aws s3 cp, to download data to the VMâs local disk. Skyplane offers a solution by allowing a fast and more cost-effective transfer of the dataset to an S3 bucket in the same region as the VM (e.g. US-West-2), with direct streaming of the data to the model without the need for downloading it to the local folder.

This process is as simple as adding just two lines of code, similar to the demonstration of the Skyplane simple copy.
# Step 1: Create a Skyplane API client. It will read your AWS credentials from the AWS CLI by default
client = skyplane.SkyplaneClient(aws_config=skyplane.AWSConfig())
# Step 2: Copy the data from the remote bucket to the local bucket.
client.copy(src=args.remote_s3_path, dst=args.local_s3_path, recursive=True)
Airflow SkyplaneOperator#
Skyplane can be easily incorporated into an Airflow DAG using a SkyplaneOperator, which can be utilized in data transfer tasks, such as replacing the S3toGCSOperator. The following example demonstrates a data analytics workflow where data is transferred from S3 to GCS to build a BigQuery dataset and then used in a PySpark data analysis job.

Contents#
CLI documentation#
skyplane cp documentation
skyplane#
skyplane [OPTIONS] COMMAND [ARGS]...
Options
- --install-completion <install_completion>#
Install completion for the specified shell.
- Options:
bash | zsh | fish | powershell | pwsh
- --show-completion <show_completion>#
Show completion for the specified shell, to copy it or customize the installation.
- Options:
bash | zsh | fish | powershell | pwsh
init#
Initialize the Skyplane CLI with your cloud credentials
skyplane init [OPTIONS]
Options
- -y, --non-interactive#
Run non-interactively
- Default:
False
- --reinit-azure, --no-reinit-azure#
- Default:
False
- --reinit-gcp, --no-reinit-gcp#
- Default:
False
- --reinit-ibm, --no-reinit-ibm#
- Default:
False
- --reinit-cloudflare, --no-reinit-cloudflare#
- Default:
False
- --disable-config-aws, --no-disable-config-aws#
- Default:
False
- --disable-config-azure, --no-disable-config-azure#
- Default:
False
- --disable-config-gcp, --no-disable-config-gcp#
- Default:
False
- --disable-config-ibm, --no-disable-config-ibm#
- Default:
True
- --disable-config-cloudflare, --no-disable-config-cloudflare#
- Default:
False
cp#
Copy files between any two cloud object stores
skyplane cp [OPTIONS] SRC DST
Options
- -r, --recursive#
If true, will copy objects at folder prefix recursively
- Default:
False
- --debug, --no-debug#
If true, will write debug information to debug directory.
- Default:
False
- --multipart, --no-multipart#
If true, will use multipart uploads.
- Default:
True
- -y, -f, --confirm#
Confirm all transfer prompts
- Default:
False
- -n, --max-instances <max_instances>#
Number of gateways
- Default:
1
- --max-connections <max_connections>#
Number of connections per gateway
- Default:
32
- --solver <solver>#
Solver to use for transfer
- Default:
direct
- -t, --tput <solver_required_throughput_gbits>#
Required throughput to be solved for in Gbps
- Default:
1
Arguments
- SRC#
Required argument
- DST#
Required argument
sync#
Sync files between any two cloud object stores
skyplane sync [OPTIONS] SRC DST
Options
- --debug, --no-debug#
If true, will write debug information to debug directory.
- Default:
False
- --multipart, --no-multipart#
If true, will use multipart uploads.
- Default:
True
- -y, -f, --confirm#
Confirm all transfer prompts
- Default:
False
- -n, --max-instances <max_instances>#
Number of gateways
- Default:
1
- --max-connections <max_connections>#
Number of connections per gateway
- Default:
32
- --solver <solver>#
Solver to use for transfer
- Default:
direct
- -t, --tput <solver_required_throughput_gbits>#
Required throughput to be solved for
- Default:
1
Arguments
- SRC#
Required argument
- DST#
Required argument
ssh#
SSH into a running gateway.
skyplane ssh [OPTIONS]
deprovision#
Deprovision all resources created by skyplane.
skyplane deprovision [OPTIONS]
Options
- -a, --all#
Deprovision all resources including networks.
- Default:
False
- --filter-client-id <filter_client_id>#
Only deprovision instances with this client ID under the instance tag.
Benchmarks#
This page compares the performance of Skyplane and cloud data transfer tools such as AWS DataSync and Google Cloud Data Transfer. Achieved results vary depending on the region tested, the size of the file, and the number of VMs used.
Benchmark: Large file benchmark#
In this benchmark, we measure the impact of striping on large file transfers. We transfer a single large file between two AWS regions. Overall, Skyplane is 113.4x faster than the AWS DataSync for the largest tested file transfer.
Benchmark setup:
Source: AWS ap-southeast-2 (Sydney)
Destination: AWS eu-west-3 (Paris)
Number of files: 1x
File size: {4, 8, 16, 32, 64}GB
Number of Skyplane VMs: 8
Benchmark: Cost comparison w/ compression#
In this benchmark, we measure the impact of Skyplaneâs decompression algorithm which saves a great deal of data egress fee. In addition, Skyplane does not charge any service fee. Overall, Skyplane is 6.2x cheaper than AWS DataSync when transfering uncompressed 220GB dump of English Wikipedia.
Benchmark setup:
Source: AWS us-east-1 (North Virginia)
Destination: AWS us-west-2 (Oregon)
Number of files: ~14,000x (sharded to 16 MB chunks)
File size: 223.67 GB
Number of Skyplane VMs: 8
Benchmark: ImageNet transfer on AWS#
In this benchmark, we measure the transfer speed and universal support of Skyplane API. For transfering a 70 GB fake imagenet, Skyplane supports transfers across AWS, GCP, and Azure. It completes the transfer in around 25 seconds for selected transfer region pair below. However, AWS DataSync only supports transfers into and out of AWS services, and it is slow.
Benchmark setup:
Source: labeled on axis
Destination: labeled on axis
Number of files: 1,152x (training and validation batches)
File size: 70 GB
Number of Skyplane VMs: 8
Benchmark: ImageNet transfer on GCP#
In this benchmark, we measure the transfer speed and universal support of Skyplane API. For transfering a 70 GB fake imagenet, Skyplane supports transfers across AWS, GCP, and Azure. It completes the transfer in around 60 seconds for selected transfer region pair below. However, GCP Data Transfer only supports transfers into GCP services, and it is slow.
Benchmark setup:
Source: labeled on axis
Destination: labeled on axis
Number of files: 1,152x (training and validation batches)
File size: 70 GB
Number of Skyplane VMs: 8
Benchmark: Impact of Chunk Sizes on Transfer time and Throughput#
In this benchmark, we measure the transfer speed and throughput of a transfer with 100 roughly 140 MB files. The transfer was done from an S3 bucket in us-west-1 to another S3 bucket in us-east-1.
Chunk Size |
Time |
Throughput |
---|---|---|
5 MiB |
81.91 s |
1.36 Gbps |
8 MiB |
34.64 s |
3.23 Gbps |
16 MiB |
19.47 s |
5.74 Gbps |
32 MiB |
19.89 s |
5.62 Gbps |
64 MiB |
16.17 s |
6.91 Gbps |
128 MiB |
18.20 s |
6.14 Gbps |
Instructions to replicate benchmarks#
To replicate select benchmarks, see the following guide:
Benchmark instructions#
Skyplane vs AWS DataSync small file#
Setup#
For this experiment, we choose fake_imagenet (from Google here) as the testing file and pick Sydney and Paris as two endpoints to transfer. The fake_imagenet has size of about 70 GB and contains batches of training, validation, and test datasets.
Skyplane#
To time the entire process of Skyplane transfer, we use linux time
command. We add -y
argument which enables auto confirm to reduce time waiting for response. We use 8 VMs per source/destination region. Below is an example of the code we run:
$ time skyplane cp -r -y s3://{bucket name in ap-southeast-2 region}/fake_imagenet/ s3://{bucket name in eu-west-3 region}/fake_imagenet/ -n 8
In the output, you will find
â
Transfer completed successfully
Transfer runtime: 28.11s, Throughput: 19.85Gbps
This is the result we include in the plot where we compare only the transfer time (provisioning time, etc. are excluded).
DataSync#
We use AWS DataSync for comparison. Instructions on how to set up it can be found here. Then, we can launch a task for transfer. Below is a snapshot of such task:
Skyplane vs GCP Data Transfer small file#
Setup#
For this experiment, we choose fake_imagenet (from Google here) as the testing file and pick different regions to transfer. The fake_imagenet has size of about 70 GB and contains batches of training, validation, and test datasets.
Skyplane#
To time the entire process of Skyplane transfer, we use linux time
command. We add -y
argument which enables auto confirm to reduce time waiting for response. We use 8 VMs per source/destination region. Below is an example of the code we run:
$ time skyplane cp -r -y s3://{bucket name in source region}/fake_imagenet/ gs://{bucket name in destination region}/fake_imagenet/ -n 8
In the output, you will find
â
Transfer completed successfully
Transfer runtime: 28.73s, Throughput: 19.42Gbps
This is the result we include in the plot where we compare only the transfer time (provisioning time, etc. are excluded).
Data Transfer#
We use Google Cloud Data Transfer for comparison. Instructions on how to set up it can be found here. Then, we can launch a job for transfer. Below is a snapshot of such job:
Skyplane vs AWS DataSync for large files#
Setup#
For this experiment, we choose randomly generated large files of fixed sizes as the testing file and pick Sydney and Paris as two endpoints to transfer. The random files have sizes of 4 GB, 8 GB, 16 GB, 32 GB, and 64 GB, each generated by linux command head -c {size}G </dev/urandom>myfile
.
Skyplane#
To time the entire process of Skyplane transfer, we use linux time
command. We add -y
argument which enables auto confirm to reduce time waiting for response. We use 8 VMs per source/destination region. Below is an example of the code we run:
$ time skyplane cp -r -y s3://{bucket name in ap-southeast-2 region}/random_file_64GB s3://{bucket name in eu-west-3 region}/random_file_64GB -n 8
In the output, you will find
â
Transfer completed successfully
Transfer runtime: 76.59s, Throughput: 6.69Gbps
This is the result we include in the plot where we compare only the transfer time (provisioning time, etc. are excluded).
DataSync#
We use AWS DataSync for comparison. Instructions on how to set up it can be found here. Then, we can launch a task for transfer. Below is a snapshot of such task:
Skyplane vs AWS DataSync vs Rsync for transfer cost#
Setup#
For this experiment, we choose uncompressed 223GB dump of English Wikipedia (from here) as the testing file and pick North Virginia and Oregon as two endpoints to transfer. The wikipedia dump is first uncompressed to tar file and then sharded to 16 MB chunks by linux command split -b
.
Skyplane#
To time the entire process of Skyplane transfer, we use linux time
command. We add -y
argument which enables auto confirm to reduce time waiting for response. We use 1 VM per source/destination region. Below is an example of the code we run:
$ time skyplane cp -r -y s3://{bucket name in us-east-1 region}/wiki_dumps s3://{bucket name in us-west-2 region}/wiki_dumps -n 1
In the output, you will find
â
Transfer completed successfully
Transfer runtime: 266.23s, Throughput: 6.26Gbps
This is the transfer time (provisioning time, etc. are excluded) that costs us money for using AWS EC2 VMs.
You will also find
Compression saved 78.94% of egress fees
This indicates the percentage of the data transfer (egress) fee saved by Skyplaneâs own compression algorithm.
To calculate the cost, we need two numbers: 1. the cost for transfering data out from N. Virginia to Oregon, which is $0.02 per GB; 2. the cost for using on-demand VMs (of type m5.8xlarge) in N. Virginia and Oregon, which is $1.536 per hour. The values are found from AWS here.
The total cost for Skyplane transfer is
223.67 GB * (1 - 0.7894) * $0.02 / GB + 2 * $(1.536 / 3600) / s * 266.23 s = $1.17
DataSync#
We use AWS DataSync for comparison. Instructions on how to set up it can be found here. Then, we can launch a task for transfer. Below is a snapshot of such task:
To calculate the cost, we need one extra number: the cost for using AWS DataSync service to copy data, which is $0.0125 per GB. It can be found here.
The total cost for AWS DataSync transfer is
223.67 GB * ($0.02 + $0.0125) / GB = $7.27
which is independent on the transfer speed.
Rsync#
We also use rsync for comparison. First we need to mount AWS S3 buckets in AWS EC2 VMs by using the s3fs tool. An example command would be s3fs mybucket /path/to/mountpoint
. Then, we can treat folders in EC2 VMs as folders in S3 buckets and use rsync to transfer. To do that, we have one VM in Oregon as local client and treat the other VM in N. Virginia as remote client.
In addition, rsync can compress the file data as it is sent to the destination client. We do so by rsync -z
command argument. It uses zlib to compress the data, and by default level of compression, it saves 86% of egress fee. The percentage is found by the following command:
pigz -6 -z -r -v wiki_dumps/
To time the entire process of rsync transfer, we use linux time
command. Finally, below is an example of the code we run:
time rsync -az -e "ssh -i .ssh/{key to access Virginia VM}.pem" ubuntu@Virginia_VM-IP-Address:~/us-east-1/ us-west-2/
In the output, you will find
real 396m46.729s
user 0m23.663s
sys 0m19.536s
The real
time is how long we have used our AWS EC2 VMs.
The total cost for rsync transfer is
223.67 GB * (1 - 0.8608) * $0.02 / GB + 2 * $(1.536 / 3600) / s * 23807 s = $20.94
Roadmap#
Skyplane is an actively developed project. It will have đȘ SHARP EDGES đȘ. Please file an issue if you encounter bugs.
If you are interested in contributing, please see our contributor guide.
Skyplane 0.2.0 release (current)#
Firewalled VM-to-VM VPCs for Azure.
Large object transfer support for Azure.
Integration testing suite in CI.
Improvements to the scalability of transfers with many files (>10000).
Upcoming#
On-prem transfer support (design needed).
Broadcast support to multiple destinations.
Skyplane managed buckets for cross-cloud object storage.