The folks in the Glor lab have been doing a lot of AFLP work recently and using structure to analyze these data. To identify K (the number of genotypic clusters in a sample in individuals) we have been using a hierarchical approach proposed by Coulon and colleagues. We’ve largely been happy with this method, but it requires that you run a lot of analyses to fully evaluate a dataset. I wrote a few small shell scripts to somewhat automate the process of running jobs using the University of Rochester Center for Integrated Research Computing’s BlueHive Cluster. The scripts and details on running them can be found after the jump…
First you need to get a compiled structure binary on Bluehive. You can download the most recent version from the structure homepage and compile it yourself, for convenience I have compiled version 2.3.4 (click to download). Note if you do download this version you will have to make it executable on Bluehive by typing:
chmod 755 structure
Once you have downloaded or compiled structure you’ll want to add it to your PATH. By default a directory named “bin” in your home directory will be part of your path. Create a directory called “bin” and then move the structure binary in.
Now from any directory typing:
structure
Should generate the message below, if so you are in business.
---------------------------------------------------- STRUCTURE by Pritchard, Stephens and Donnelly (2000) and Falush, Stephens and Pritchard (2003) Code by Pritchard, Falush and Hubisz Version 2.3.4 (Jul 2012) ---------------------------------------------------- Can't open the file "mainparams". Exiting the program due to error(s) listed above.
Next you will want to create a script to submit structure jobs to the cluster. My version is called looper.sh. This script submits a batch job to Bluehive requesting 1 processor for 24 hours. If your run takes longer, adjust walltime to be greater than the length of time needed to run the analysis. The script passes a few options to structure. -K is the number of clusters to analyze for any particualar run, -o sets the name of the output file and -D supplies a random number to start the mcmc analysis. The script uses a bash environmental variable to define a random number, which is reported at the top of the output file.
#!/bin/bash #PBS -q standard #PBS -l nodes=1:ppn=1 #PBS -l walltime=24:00:00 #PBS -N structure #PBS -j oe #PBS -k n #PBS -o outfile.k_${k}_run cd $PBS_O_WORKDIR structure -K ${k} -o k_${k}_run_$PBS_ARRAYID -D $RANDOM
Now that we have a basic submission script, we need to give it instructions for exactly how we want our run to be performed. A second script called run_structure.sh accomplishes this. In this script two variables are set: the maximum number of clusters to evaluate and the number of replicates to run for each level of K. The script then runs a loop to instruct the looper script. In the case below looper submit 10 runs for K =1, K=2 and K=3. By editing this script you can run any combination of K and reps that you like.
#!/bin/bash max_k=3 reps=10 for (( i=1 ; i <= $max_k ; i++ )) do qsub -v k=$i -t 1-$reps looper.sh done
After you first create the run_structure script you will need to change permissions to the control script to allow it to run as an executable using the command:
chmod 755 run_structure.sh
This only needs to be run once, subsequent edits to the script will not change its permissions.
Finally to run your job, create a directory with the standard structure run files: a project_data file, a mainparams file and an extraparams file. Each of these need to be formatted with Unix line endings. Copy in the run_structure.sh and looper.sh scripts. To start a run type:
./run_structure.sh
If your run has submitted successfully you should see output similar to below, with one line for each level of K.
[username@bluehive ]$ ./run_structure.sh 2613562[].bhsn-int.bluehive.crc.private 2613563[].bhsn-int.bluehive.crc.private 2613564[].bhsn-int.bluehive.crc.private
You can then check on the status of your runs by typing:
qinfo