Bulk Download of ArXiv's Repository
In this post, we demonstrate step-by-step how to download all papers from ArXiv. ArXiv is an open access repository in which researchers share their manuscripts before their publication to a conference or journal.
The arXiv dataset is available on Amazon’s S3 cloud storage and it is not freely available. As of the date of this post, the cost is around 50$.
This post is a distillation of the following page.
Install python and s3cmd
First, install Python. To access amazon’s S3, we will use a forked version of s3cmd which is available on github and contains a patch for downloading the arXiv dataset, see also here.
Configure AWS to access S3
Next you must configure AWS so that you can access S3. To do so, you need to generate an ACCESS_KEY and a SECRET_ACCESS_KEY. For more details, the reader is referred to aws access key.
Then, type
s3cmd --configure
to configure s3cmd with your credentials.
Verify your setup
The following command should reply with DIR s3://arxiv/pdf/
s3cmd ls --add-header="x-amz-request-payer: requester" s3://arxiv/pdf
Download all papers
To download all latex source code of the papers, type
s3cmd get --add-header="x-amz-request-payer: requester" s3://arxiv/src/
Similary, to download all pdfs type
s3cmd get --add-header="x-amz-request-payer: requester" s3://arxiv/pdf/
Estimate the download cost
In order to estimate the download cost, first check the current download rate on S3 per GB and it is also possible to calculate the size in GB of the download using s3cmd as follows
s3cmd ls --add-header="x-amz-request-payer: requester" s3://arxiv/src/\* > all_files.txt
awk '{s += $3}END { print "sum is", s, " average is", s/NR }' all_files