Lab -- Publicly Accessible Datasets

By Dave Eargle

Open data science means that others should be able to rereun your code and replicate your results. The first hurdle to that is making your data pubilcy accessible. I’m ignoring data access rights and terms of use and just talking about streamlining the access ability.

Goals: should be one-click for the user, which means the following are out:

  • the user mouting a shared google drive
  • the user downloading data from kaggle et al and putting it in a certain working directory

Maybe-viable options include the following:

  • Find it in an already-publicy-accessible place
    • e.g., OpenML datasets can be direct-downloaded via requesting a special url
  • host the data on GitHub.
    • Caveat – 25MB (Mb?) filesize limit.
  • host the data on a personal cloud storage provider – google drive, dropbox, etc.
    • Caveat – some trickery is required to get a direct access link.
  • host the data on a cloud computing platform, such as google cloud or aws s3.
    • Caveat – requires the code-creator to maintain a cloud computing account. But arguably that’s a fringe positive skill to demonstrate for a data scientists.

Let’s try all!

Already-hosted datasets, such as OpenML or UCI ML

You can use url ‘hacking’ (h@x0r-ing) to extract direct-download links from places where datasets are already hosted, as long as the download link does not require authentication. (You could still programmatically get links that require authentication, but that doesn’t work for Open Data Science without sharing your username-password).

Practice with an OpenML dataset

Let’s try with an OpenML dataset such as the spambase dataset.

  1. In a web browser, visit https://www.openml.org/d/44.

  2. Look at that dataset page’s url:

    https://www.openml.org/d/44

    We intuit that the dataset id is 44.

  3. On the dataset’s page, there’s a cute little cloud download icon with “CSV” under it towards the top right of the page. Hover over it to see the link address. The one for this dataset is the following:

    https://www.openml.org/data/get_csv/44/dataset_44_spambase.arff

    If you click that url, the csv should download.

    What is an .arff file, you ask? Well, it doesn’t matter, as long as it’s structured tabularly or csv-ily. If you can open the downloaded file in e.g. Excel, it will work for pd.read_csv and the like. So forget about it!

  4. Right-click it and select “Copy link address” (at least in Chrome, all browsers have something similar).

    Paste this url into a new browser tab. If it downloads a csv to your browser, then tada! you have a direct-download link.

  5. Imagine you were interested in generalizing the url pattern. Note the URL pattern. It’s something like:

    https://www.openml.org/data/get_csv/<dataset_id>/<some_specific_filename>

    It’s easy enough to generalize where to insert the dataset_id, but I’m nervous about the <some_specific_filename>. Let’s take a guess though and see if the url works without providing the specific filename – maybe the site’s api will default to providing some default csv for the given dataset. Edit the url to just be the following:

    https://www.openml.org/data/get_csv/44/

    And paste it into a browser. Yay, it works.

    Know that some web servers may be picky about url routing – for example, it might not work without a trailing slash. We don’t know what web server openML is using, but we can black-box test it. Try without a trailing slash:

    https://www.openml.org/data/get_csv/44

    It still works, at least for this site.

    What this means is that you could use the above url directly in your script via a call such as:

    import pandas as pd
    pd.read_csv('https://www.openml.org/data/get_csv/44')
    

Note about sklearn.datasets

The [sklearn.datasets] module has several convenience functions for loading datasets, including fetch_openml. This uses the OpenML api to programmatically find the download url for a given dataset.

In general, if an API exists, it will be more stable for fetching files than url hacking will be. You should use APIs when stability matters.

We could use the sklearn.datasets.fetch_openml function to download the spambase dataset as follows:

>>> from sklearn.datasets import fetch_openml
>>> spam  = fetch_openml(name='spambase')

The data is available under key data:

>>> spam.data.head()
   word_freq_make  word_freq_address  word_freq_all  ...  capital_run_length_average  capital_run_length_longest  capital_run_length_total
0            0.00               0.64           0.64  ...                       3.756                        61.0                     278.0
1            0.21               0.28           0.50  ...                       5.114                       101.0                    1028.0
2            0.06               0.00           0.71  ...                       9.821                       485.0                    2259.0
3            0.00               0.00           0.00  ...                       3.537                        40.0                     191.0
4            0.00               0.00           0.00  ...                       3.537                        40.0                     191.0

[5 rows x 57 columns]

This particular dataset was converted to a pandas dataframe, since column names were available:

>>> type(spam.data)
pandas.core.frame.DataFrame

We can pin the version by first inspecting the version of the dataset that was downloaded:

# what version is this?
>>> spam.details['version']
1

And then by modifying our earlier code:

>>> spam = fetch_openml(name='spambase', version='1')

N.B.: sklearn.datasets also has a fetch_kddcup99 convenience function, which includes the ability to load only 10 percent of the data.

Hosting on your own

If your datasets are not already available publicly somewhere else, you can do one of the following.

Uploading small datasets to github

You can host your own datasets that are < 25 MB on github. For demonstration purposes, I committed the above spambase dataset to github. View it at https://github.com/deargle-classes/security-analytics-assignments/blob/main/datasets/dataset_44_spambase.csv.

I cannot use the above as a direct download link. It would download just what you see in your browser – an html wrapper around the dataset. I need just the dataset!

  1. Github provides a convenient ?raw=true url argument you can append to get a “raw” file, not-wrapped in html.

    Append that to the above url, and click it, and see if you get the csv:

    https://github.com/deargle-classes/security-analytics-assignments/blob/main/datasets/dataset_44_spambase.csv?raw=true

    Hurray, yes you do.

  2. Note the resolved url in your address bar after you click the above link:

    https://raw.githubusercontent.com/deargle-classes/security-analytics-assignments/main/datasets/dataset_44_spambase.csv

    We infer that ?raw=true redirects us to a raw.githubusercontent.com path. We could alternatively infer the pattern from the above link, and get the same result in our analytics script.

  3. Also note that the html-wrapped view of the file has a “raw” button on it, above the dataset to the right. Hover over and inspect where that button would take you:

    https://github.com/deargle-classes/security-analytics-assignments/raw/main/datasets/dataset_44_spambase.csv

    This url says /raw instead of /blob. Click the url and note that it redirects you to the above raw.githubusercontent.com link. You could infer a different workable /raw pattern from the above.

(Working as of 2/16/2021).

Sharing larger datasets from Google Drive or Dropbox

You can manipulate sharing links from Google Drive or Dropbox to get direct download links that you can use in scripts.

Google Drive

Using the browser https://drive.google.com view, I uploaded the spambase dataset to my personal google drive, in a folder I created called “datasets”.

Right-click the file and select “Get link.” In the popup window, change the access rights to be that “anyone with the link” can view. Copy the link to your clipboard. My link looks like this:

https://drive.google.com/file/d/19xCOJyKJ-VSCGL-cM2pwY3fCVw-yzMsM/view?usp=sharing

Our goal is to convert the above into a direct download link. This one is trickier than GitHub’s. When I visit the above link in my browser, I see a promising “download” button on the top right. But I don’t get a url when I hover over it. Ah, Google Drive must be using javascript magic!

When I click on the download button, I notice a new tab open, and then close. Like a ninja I copied the url out of the new tab before it closed, and I got this:

https://drive.google.com/u/0/uc?id=19xCOJyKJ-VSCGL-cM2pwY3fCVw-yzMsM&export=download

This url pattern seems harder to generalize from, but I notice that the id in the second url is also present in the first url. Taking a guess, I’m crossing my fingers that the pattern is as simple as the following for all files:

https://drive.google.com/u/0/uc?id=<id_pulled_from_first_url>&export=download

But I need to confirm this!

This SO post says that the url pattern does generalize, with the slightly simpler form as follows:

https://docs.google.com/uc?export=download&id=YourIndividualID

Dropbox

If you get a public “sharing link” for a file in Dropbox and examine the URL, you’ll notice that it looks like this:

https://www.dropbox.com/s/611argvbp5dyebw/HealthInfoBreaches.csv?dl=0

I took a guess and changed the dl=0 at the end to dl=1, and tada it direct-downloaded! Like this:

https://www.dropbox.com/s/611argvbp5dyebw/HealthInfoBreaches.csv?dl=1

Easy enough.

AWS S3

You can create an AWS S3 “bucket” from which you host public datasets. The s3 stands for “simple storage service”. The process to create a bucket is hardly “simple,” but it is one-time. Charges, if any, should be extremely minimal (less than 10 cents a month?).

  • Create an AWS account if you don’t have one already
  • Navigate to the S3 Service
  • I recommend creating a new bucket if you don’t have one already. Its name must be globally unique across all of S3. I created one using my username, called deargle-public-datasets.

    This bucket is intended for open data science replication, so your bucket needs to allow public access. So turn off “Block public access.” I left on other bucket defaults.

However, the above on its own will not make your files publicly accessible. It just makes it so they can be made publicly accessible.

  • On the bucket “Permissions” tab, scroll down to the “Bucket Policy” area. This is the area where you write policies using JSON.

  • Click “Edit,” and paste in the following policy, replacing the value for the “Resource” key with the ARN shown for your bucket immediately above the Edit pane. Leave the trailing /* at the end of your resource name.

    {
        "Version": "2008-10-17",
        "Statement": [
            {
                "Sid": "AllowPublicRead",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "*"
                },
                "Action": "s3:GetObject",
                "Resource": "arn:aws:s3:::deargle-public-datasets/*"
            }
        ]
    }
    

    You will get a lot of very scary warnings about how your bucket is now public, public public!. This is desired for our open-data-science use case. Only upload datasets to this bucket that you want the world to have read-access to.

Next:

  1. Upload your dataset to the bucket
  2. Navigate to the page for your newly-uploaded file by clicking on the filename.

On the file details page, you can find your file’s “Object URL.”

To test that anyone can access your dataset using this URL, copy-paste this into a browser window in which you are not logged in to AWS (for instance, into an incognito or private browsing window).

If you set the policy correctly, your file should download. This means that you can use this url in your analytics script.

Google Cloud

Similar to AWS S3, Google Cloud also uses ‘buckets’ from which you can host public datasets. Relative to AWS, the process to share files from a GCP bucket is extremely simple.

  • Create a GCP account if you don’t have one already
  • Open your GCP web console (https://console.cloud.google.com/), and select a project.
    • Unlike AWS where s3 buckets are specific to accounts, buckets in GCP are specific to projects
  • Proceed to Storage -> Browser in the navigation pane and create a new bucket.
    • I created one called deargle-public-datasets
    • For “Choose where to store your data,” I left the default.
    • For “Choose a default storage class for your data,” I left the default.
    • For “Choose how to control access to objects,” I chose Uniform.
  • Read and follow the GCP documentation for Making all objects in the bucket publicly readable. As of 2/20/2021, those instructions are the following:

  • Ignore all scary warnings about this bucket’s public-ness, past present and future
  • Click back to the “Objects” tab and upload your files to your new bucket
  • When the upload is complete, click on the file to view object details. You should see a “Public URL” for your dataset. Navigate to this url using a private-browsing window to ensure that your dataset is publicly-available.
  • Use this url with pd.read_csv().

Deliverable

Use this jupyter notebook template.

Follow the instructions in the template notebook to complete and submit your deliverable.