Galaxy 101: The first thing you should try

In this very simple example we will introduce you to bare basics of Galaxy:

  • Getting data from UCSC
  • Performing simple data manipulation
  • Understanding Galaxy's History system
  • Creating and editing workflows
  • Applying workflows to your data

You can watch a step-by-step explanation of this entire tutorial here.

What are we trying to do?

Suppose you get the following question: "Mom (or Dad) ... Which coding exon has the highest number of single nucleotide polymorphisms on chromosome 22?".  You think to yourself "Wow! This is a simple question ... I know exactly where the data is (at UCSC) but how do I actually compute this?" The truth is, there is really no straightforward way of answering this question in a time frame comparable to the attention span of a 7-year-old. Well ... actually there is and it is called Galaxy. So let's try it...

0. Organizing your windows and setting up Galaxy account

0.0. Getting your display sorted out

To get the most of this tutorial open two browser windows. One you already have (it is this page). To open the other, right click this link and choose "Open in a New Window" (or something similar depending on your operating system and browser):

Then organize your windows as something like this (depending on the size of your monitor you may or may not be able to organize things this way, but you get the idea):

0.1. Setting up Galaxy account

Go to the User link at the top of Galaxy interface and choose Register (unless of course you already have an account):

Then enter your information and you're in!

1. Getting data from UCSC

1.0. Getting coding exons

First thing we will do is to obtain data from UCSC by clicking "Get Data -> UCSC Main":

You will see Galaxy's middle pane change to looks like this:

Make sure that your settings are exactly the same as shown on the screen (in particular, position should be set to "chr22", output format should be set to "BED - browser extensible data", and "Galaxy" should be checked by Send output to option). Click get output and you will see the next screen:

here make sure Create one BED record per is set to "Coding Exons" and click Send Query to Galaxy. After this you will see your first History Item in Galaxy's right pane. It will go through gray (preparing) and yellow (running) states to become green:


1.1. Getting SNPs

Now is the time to obtain SNP data. This is done almost exactly the same way. First thing we will do is to again click on "Get Data -> UCSC Main":

but now change group to "Variation and Repeats":

so that the whole page looks like this:

click get output and you should see this:

where you need to make sure that Whole Gene is selected ("Whole Gene" here really means "Whole Feature") and click Send Query to Galaxy. You will get your second item in the history:

Now we will rename the two history items to "Exons" and "SNPs" by clicking on the Pencil icon adjacent to each item. Also we will rename history to "Galaxy 101" (or whatever you want) by clicking on "Unnamed history" so everything looks like this:

2. Finding Exons with the highest number of SNPs

2.0. Joining exons with SNPs

Let's remind ourselves that our objective was to find which exon contains the most SNPs. This first step in answering this question will be joining exons with SNPs (a fancy word for printing exons and SNPs that overlap side by side). This is done using "Operate on Genomics Intervals -> Join" tool:

make sure your exons are first and SNPs are second and click Execute. You will get the third history item:

which will contain the following data:

chr22 16258185 16258303 uc002zlh.1_cds_1_0_chr22_16258186_r 0 - chr22 16258278 16258279 rs2845178  0 +
chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267011 16267012 rs7290262  0 +
chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16266963 16266964 rs10154680 0 +
chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267037 16267038 rs2818572  0 +
chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267031 16267032 rs7292200  0 +

Let's take a look at this dataset. The first six columns correspond to exons. The last six correspond to SNPs. You can see that exon with ID uc002zlh.1_cds_2_0_chr22_16266929_r contains four SNPs with IDs rs7290262, rs10154680, rs2818572, and rs7292200. 

2.1. Counting the number of SNPs per exon

Above we've seen that exon uc002zlh.1_cds_2_0_chr22_16266929_r is repeated four times in the above dataset. Thus we can easily compute the number of SNPs per exon by simply counting the number of repetitions of name for each exon. This can be easily done with the "Join, Subtract, and Group -> Group" tool:

choose column 4 by selecting "c4" in Group by column. Then click on Add new Operation and make sure the interface looks exactly as shown below:

click Execute. Your history will look like this:

if you look at the above image you will see that the result of grouping (dataset #4) contains two columns. This first contains the exon name while the second shows the number of times this name has been repeated in dataset #3. 

2.3. Sorting exons by SNP count

To see which exon has the highest number of SNPs we can simply sort the dataset #4 on the second column in descending order. This is done with "Filter and Sort -> Sort":

This will generate the fifth history item:

and you can now see that the highest number of SNPs per exon is 67. 

2.4. Selecting top five

Now let's select top five exons with the highest number of SNPs. For this we will use "Text Manipulation -> Select First" tool:

Clicking Execute will produce the sixth history item that will contain just five lines:

2.5. Recovering exon info and displaying data in genome browsers

Now we know that in this dataset the five top exons contain between 41 and 67 SNPs. But what else can we learn about these? To know more we need to get back the positional information (coordinates) of these exons. This information was lost at the grouping step and now all we have is just two columns. To get coordinates back we will match the names of exons in dataset #6 (column 1) against names of the exons in the original dataset #1 (column 4). This can be done with "Join, Subtract and Group -> Compare two Queries" tool (note the settings of the tool in the middle pane):

this adds the seventh dataset to the history:

The best way to learn about these exons is to look at their genomic surrounding. There is really no better way to do this than using genome browsers. Because this analysis was performed on "standard" human genome, you have two choices - UCSC Genome Browser and Ensembl:

For example, clicking on "display at UCSC main" will show this (to see your regions look at "User Track" on top of browser image):

3. Understanding histories

In Galaxy your analyses live in histories such as this one:

Histories can be very large, you can have as many histories as you want, and all history behavior is controlled by the Options button on the top of the History pane:

Many of the options here are self explanatory. If you create a new history, your current history does not disappear. If you would like to list all of your histories just choose Saved Histories and you will see a list of all your histories in the center pane:


4. Converting histories into workflows

One of the history options listed above is very special. It allows you to easily convert existing histories into analysis workflows. Why would you want to create a workflows out of a history? To redo the analysis again with minimal clicking. 

4.0. Extracting workfklow

Lets take a look at the history again:

You can see that this history contains all steps of our analysis. So by building this history we have actually built a complete record of our analysis with Galaxy preserving all parameter settings applied at every step. Wouldn't it be nice to just convert this history into a workflow that we'll be able to execute again and again? This can be done by clicking on Options button:

and selecting Extract Workflow option:

The center pane will change as shown below and you will be able to choose which steps to include/exclude and how to name the newly created workflow. In this case I named it "galaxy101":

once you click Create Workflow you will get the following message: "Workflow 'galaxy101' created from current history". But where did it go? Click on Workflow link at the top of Galaxy interface and you will a list of all workflows with "galaxy101" listed at the top:

4.1. Opening workflow editor

If you click on a triangle adjacent to the workflow's name you will see the following dialogue:

Click Edit and the workflow editor will launch. It will allow you to examine and change settings of this workflow as shown below. Note that the box corresponding to the "Select First" tool is selected (highlighted with the blue border) and you can see parameters of this tool on the right pane. This is how you can view and change parameters of all tools involved in the workflow.

4.2. Hiding intermediate steps

Among multiple things you can do with workflows I will just mention one. When workflow is executed one is usually interested in the final product and not in the intermediate steps. These steps can be hidden by mousing over a small asterisk in the lower right corner of every tool box:

Yet there is a catch. In a newly created workflow all steps are hidden by default and default behavior of Galaxy is that if all steps of a given workflow are hidden, then nothing gets hidden in the history. This may be counterintuitive, but this is done to decrease the amount of clicking if you do want to hide some steps. So in our case if we want to hide all intermediate steps with the exception of the last one we will click that asterisk in last step of the workflow:

Once you do this the representation of the workflow in the bottom right corner of the editor will change with the last step becoming orange. This means that this is the only step, which will generate a dataset visible in the history:

4.3. Renaming inputs

Right now both inputs to the workflow look exactly the same. This is a problem as will be very confusing which input should be exons and which should be SNPs:

One the image above you will see that the top input dataset (the one with the blue border) connects to the Join tool first, so it must correspond to the exon data. If you click on this box (in the image above it is already clicked on because it is outlined with the blue border) you will be able to rename the dataset in the right pane:

Then click on the second input dataset and rename it "Features" (this would make this workflow a bit more generic, which will be useful later in this tutorial):

4.4. Renaming outputs

Finally let's rename the workflow's output. For this click on the last dataset ("Compare two Queries") and in the Edit Step Actions dialogue box select "Rename Dataset"

Click Create:

and call it something like "top 5 exons":

4.5. Save! It is important...

Now let's save the changes we've made by clicking Options (top of the center pane) and selecting Save:

5. Run workflow on whole genome data

Now that we have a workflow, let's do something grand like, for example, finding exons with the highest number of repetitive elements.

5.0. Create a new history

Before we start let's create a new history by clicking Options and selecting Create New:

5.1. Get Exons

Now let's get coding exons for the entire genome by going to "Get Data -> UCSC Main" and setting up parameters as shown below. Note that this time region radio button is set to "genome":

Click get output and you will get the next page (if it looks different from the image below, go back and make sure output format is set to "BED - browser extensible format"):

Choose "Coding exons" and click Send query to Galaxy.

5.2. Get Repeats

Go again to "Get Data -> UCSC Main" and make sure the following settings are selected (in particular group = "Variation and Repeats" and track = "RepeatMasker"):

Click get output and you will get the next page (if it looks different from the image below, go back and make sure output format is set to "BED - browser extensible format"):

Select "Whole gene" and click Send Query to Galaxy.

5.3. Start the Workflow

At this point you will have two items in your history - one with exons and one with repeats. These datasets are very large (especially repeats) and it will take some time for them to become green. Luckily you do not have to wait as Galaxy will automatically start jobs once uploads have ended. So nothing stops us from starting the workflow we have created. First, click on the Workflow link at the top of Galaxy interface, mouse over "galaxy101", and click on the arrow:

choose Run:

Center pane will change to allow you launching the workflow. Select appropriate datasets for Repeats and Exon inputs as shown below, scroll down, and click Run workflow.

Once workflow has started you will initially be able to see all its steps. Note that you are joining 77,614 exons with 5,298,130 repeats, so naturally this will take some time:

5.4. Get coffee

As we mentioned above this will take some time, so go get coffee and then you will see this. Note that because all intermediate steps of the workflow were hidden, once it is finished you will only see the final dataset #7:


6. We did not fake this:

The two histories and the workflow described in this page are accessible directly from this page below. The histories are embedded below (click on the green plus icon to import):

The workflow is embedded here (click on the green plus icon to import):