Bioinformatics Data - Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project - Bioinformatics Data Skills (2015)

Bioinformatics Data Skills (2015)

Part II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project

Chapter 6. Bioinformatics Data

Thus far, we’ve covered many of the preliminaries to get started in bioinformatics: organizing a project directory, intermediate Unix, working with remote machines, and using version control. However, we’ve ignored an important component of a new bioinformatics project: data.

Data is a requisite of any bioinformatics project. We further our understanding of complex biological systems by refining a large amount of data to a point where we can extract meaning from it. Unfortunately, many tasks that are simple with small or medium-sized datasets are a challenge with the large and complex datasets common in genomics. These challenges include:

Retrieving data

Whether downloading large sequencing datasets or accessing a web application hundreds of times to download specific files, retrieving data in bioinformatics can require special tools and skills.

Ensuring data integrity

Transferring large datasets across networks creates more opportunities for data corruption, which can later lead to incorrect analyses. Consequently, we need to ensure data integrity with tools before continuing with analysis. The same tools can also be used to verify we’re using the correct version of data in an analysis.

Compression

The data we work with in bioinformatics is large enough that it often needs to be compressed. Consequently, working with compressed data is an essential skill in bioinformatics.

Retrieving Bioinformatics Data

Suppose you’ve just been told the sequencing for your project has been completed: you have six lanes of Illumina data to download from your sequencing center. Downloading this amount of data through your web browser is not feasible: web browsers are not designed to download such large datasets. Additionally, you’d need to download this sequencing data to your server, not the local workstation where you browse the Internet. To do this, you’d need to SSH into your data-crunching server and download your data directly to this machine using command-line tools. We’ll take a look at some of these tools in this section.

Downloading Data with wget and curl

Two common command-line programs for downloading data from the Web are wget and curl. Depending on your system, these may not be already installed; you’ll have to install them with a package manager (e.g., Homebrew orapt-get). While curl and wget are similar in basic functionality, their relative strengths are different enough that I use both frequently.

wget

wget is useful for quickly downloading a file from the command line — for example, human chromosome 22 from the GRCh37 (also known as hg19) assembly version:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz

--2013-06-30 00:15:45-- http://[...]/goldenPath/hg19/chromosomes/chr22.fa.gz

Resolving hgdownload.soe.ucsc.edu... 128.114.119.163

Connecting to hgdownload.soe.ucsc.edu|128.114.119.163|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 11327826 (11M) [application/x-gzip]

Saving to: ‘chr22.fa.gz’

17% [======> ] 1,989,172 234KB/s eta 66s

wget downloads this file to your current directory and provides a useful progress indicator. Notice that the link to chromosome 22 begins with “http” (short for HyperText Transfer Protocol). wget can also handle FTP links (which start with “ftp,” short for File Transfer Protocol). In general, FTP is preferable to HTTP for large files (and is often recommended by websites like the UCSC Genome Browser).

Because UCSC generously provides the human reference genome publicly, we don’t need any authentication to gain access to this file. However, if you’re downloading data from a Lab Information Management System (LIMS), you may need to first authenticate with a username and password. For simple HTTP or FTP authentication, you can authenticate using wget’s --user= and --ask-password options. Some sites use more elaborate authentication procedures; in these cases, you’ll need to contact the site administrator.

One of wget’s strengths is that it can download data recursively. When run with the recursive option (--recursive or -r), wget will also follow and download the pages linked to, and even follow and download links on these pages, and so forth. By default (to avoid downloading the entire Internet), wget limits itself to only follow five links deep (but this can be customized using the option --level or -l).

Recursive downloading can be useful for downloading all files of a certain type from a page. For example, if another lab has a web page containing many GTF files we wish to download, we could use:

$ wget --accept "*.gtf" --no-directories --recursive --no-parent \

http://genomics.someuniversity.edu/labsite/annotation.html

But beware! wget’s recursive downloading can be quite aggressive. If not constrained, wget will download everything it can reach within the maximum depth set by --level. In the preceding example, we limited wget in two ways: with --no-parent to prevent wget from downloading pages higher in the directory structure, and with --accept "*.gtf", which only allows wget to download filenames matching this pattern.

Exercise caution when using wget’s recursive option; it can utilize a lot of network resources and overload the remote server. In some cases, the remote host may block your IP address if you’re downloading too much too quickly. If you plan on downloading a lot of data from a remote host, it’s best to inquire with the website’s sysadmin about recommended download limits so your IP isn’t blocked. wget’s --limit-rate option can be used to limit how quicklywget downloads files.

wget is an incredibly powerful tool; the preceding examples have only scraped the surface of its capabilities. See Table 6-1 for some commonly used options, or man wget for an exhaustive list.

Option

Values

Use

-A, --accept

Either a suffix like “.fastq” or a pattern with *, ?, or [ and ], optionally comma-separated list

Only download files matching this criteria.

-R, --reject

Same as with --accept

Don’t download files matching this; for example, to download all files on a page except Excel files, use --reject ".xls".

-nd, --no-directory

No value

Don’t place locally downloaded files in same directory hierarchy as remote files.

-r, --recursive

No value

Follow and download links on a page, to a maximum depth of five by default.

-np, --no-parent

No value

Don’t move above the parent directory.

--limit-rate

number of bytes to allow per second

Throttle download bandwidth.

--user=user

FTP or HTTP username

Username for HTTP or FTP authentication.

--ask-password

No value

Prompt for password for HTTP of FTP authentication; --password= could also be used, but then your password is in your shell’s history.

-O

Output filename

Download file to filename specified; useful if link doesn’t have an informative name (e.g., http://lims.sequencingcenter.com/seqs.html?id=sample_A_03).

Table 6-1. Useful wget options

Curl

curl serves a slightly different purpose than wget. wget is great for downloading files via HTTP or FTP and scraping data from a web page using its recursive option. curl behaves similarly, although by default writes the file to standard output. To download chromosome 22 as we did with wget, we’d use:

$ curl http://[...]/goldenPath/hg19/chromosomes/chr22.fa.gz > chr22.fa.gz

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

14 10.8M 14 1593k 0 0 531k 0 0:00:20 0:00:02 0:00:18 646k

Note that I’ve had to truncate the URL so this example fits within a page; the URL is the same as we used with wget earlier.

If you prefer not to redirect curl’s output, use -O <filename> to write the results to <filename>. If you omit the filename argument, curl will use same filename as the remote host.

curl has the advantage that it can transfer files using more protocols than wget, including SFTP (secure FTP) and SCP (secure copy). One especially nice feature of curl is that it can follow page redirects if the -L/--location option is enabled. With this enabled, curl will download the ultimate page the link redirects to, not the redirect page itself. Finally, Curl itself is also a library, meaning in addition to the command-line curl program, Curl’s functionality is wrapped by software libraries like RCurl and pycurl.

Rsync and Secure Copy (scp)

wget and curl are appropriate for quickly downloading files from the command line, but are not optimal for some heavier-duty tasks. For example, suppose a colleague needs all large sequencing datasets in your project directory that are ignored by Git (e.g., in your .gitignore). A better tool for synchronizing these entire directories across a network is Rsync.

Rsync is a superior option for these types of tasks for a few reasons. First, Rsync is often faster because it only sends the difference between file versions (when a copy already exists or partially exists) and it can compress files during transfers. Second, Rsync has an archive option that preserves links, modification timestamps, permissions, ownership, and other file attributes. This makes Rsync an excellent choice for network backups of entire directories. Rsync also has numerous features and options to handle different backup scenarios, such as what to do if a file exists on the remote host.

rsync’s basic syntax is rsync source destination, where source is the source of the files or directories you’d like to copy, and destination is the destination you’d like to copy these files to. Either source or destination can be a remote host specified in the format user@host:/path/to/directory/.

Let’s look at an example of how we can use rsync to copy over an entire directory to another machine. Suppose you’d like to copy all of your project’s data in zea_mays/data to your colleague’s directory/home/deborah/zea_mays/data on the host 192.168.237.42. The most common combination of rsync options used to copy an entire directory are -avz. The option -a enables wrsync’s archive mode, -z enables file transfer compression, and -v makes rsync’s progress more verbose so you can see what’s being transferred. Because we’ll be connecting to the remote host through SSH, we also need to use -e ssh. Our directory copying command would look as follows:

$ rsync -avz -e ssh zea_mays/data/ vinceb@[...]:/home/deborah/zea_mays/data

building file list ... done

zmaysA_R1.fastq

zmaysA_R2.fastq

zmaysB_R1.fastq

zmaysB_R2.fastq

zmaysC_R1.fastq

zmaysC_R2.fastq

sent 2861400 bytes received 42 bytes 107978.94 bytes/sec

total size is 8806085 speedup is 3.08

One subtle yet important behavior of rsync is that trailing slashes (e.g., data/ versus data) are meaningful when specifying paths in rsync. A trailing slash in the source path means copy the contents of the source directory, whereas no trailing slash means copy the entire directory itself. Because we’d like to copy all contents of zea_mays/data/ to /home/deborah/zea_mays/data in our example, we use a trailing slash. If the data/ directory didn’t already exist on the remote destination host, we’d want to copy it and its contents by using zea_mays/data (e.g., omitting the trailing slash).

Because rsync only transmits files if they don’t exist or they’ve changed, you can (and should) run rsync again after your files have transferred. This operates as a simple check to ensure everything is synchronized between the two directories. It’s also a good idea to check the exit status of rsync when calling it in scripts; rsync will exit with a nonzero status if it runs into problems transferring files. Lastly, rsync can use host aliases specified in an SSH config file (see the first Tip in “Connecting to Remote Machines with SSH”). You can omit -e ssh if you connect to a host through an SSH host alias.

Occasionally, we just need to quickly copy a single file over SSH — for tasks where Unix’s cp would be sufficient, but needs to work over an SSH connection. rsync would work, but it’s a bit overkill. Secure copy (scp) is perfect for this purpose. Secure copy works just like cp, except we need to specify both host and path (using the same user@host:/path/to/file notation as wget). For example, we could transfer a single GTF file to192.168.237.42:/home/deborah/zea_mays/data/ using:

$ scp Zea_mays.AGPv3.20.gtf 192.168.237.42:/home/deborah/zea_mays/data/

Zea_mays.AGPv3.20.gtf 100% 55 0.1KB/s 00:00

Data Integrity

Data we download into our project directory is the starting point of all future analyses and conclusions. Although it may seem improbable, the risk of data corruption during transfers is a concern when transferring large datasets. These large files take a long time to transfer, which translates to more opportunities for network connections to drop and bits to be lost. In addition to verifying your transfer finished without error, it’s also important to explicitly check the transferred data’s integrity with checksums. Checksums are very compressed summaries of data, computed in a way that even if just one bit of the data is changed, the checksum will be different.

Data integrity checks are also helpful in keeping track of data versions. In collaborative projects, our analyses may depend on our colleagues’ intermediate results. When these intermediate results change, all downstream analyses that depend on these results need to be rerun. With many intermediate files, it’s not always clear which data has changed and which steps need to be rerun. The checksums would differ if the data changed even the tiniest bit, so we can use them to calculate the version of the data. Checksums also facilitate reproducibility, as we can link a particular analysis and set of results to an exact version of data summarized by the data’s checksum value.

SHA and MD5 Checksums

The two most common checksum algorithms for ensuring data integrity are MD5 and SHA-1. We’ve already encountered SHA-1 in Chapter 4, as this is what Git uses for its commit IDs. MD5 is an older checksum algorithm, but one that is still commonly used. Both MD5 and SHA-1 behave similarly, but SHA-1 is newer and generally preferred. However, MD5 is more common; it’s likely to be what you encounter if a server has precomputed checksums on a set of files.

Let’s get acquainted with checksums using SHA-1. We can pass arbitrary strings to the program shasum (on some systems, it’s sha1sum) through standard in:

$ echo "bioinformatics is fun" | shasum

f9b70d0d1b0a55263f1b012adab6abf572e3030b -

$ echo "bioinformatic is fun" | shasum

e7f33eedcfdc9aef8a9b4fec07e58f0cf292aa67 -

The long string of numbers and letters is the SHA-1 checksum. Checksums are reported in hexadecimal format, where each digit can be one of 16 characters: digits 0 through 9, and the letters a, b, c, d, e, and f. The trailing dash indicates this is the SHA-1 checksum of input from standard in. Note that when we omitted the “s” in “bioinformatics” and calculate the SHA-1 checksum, the checksum value has entirely changed. This is the strength of using checksums: they change when the tiniest part of the input changes. Checksums are completely deterministic, meaning that regardless of the time the checksum is calculated or the system used, checksum values will only differ if the input differs.

We can also use checksums with file input (note that the content of Csyrichta_TAGGACT_L008_R1_001.fastq is fake example data):

$ shasum Csyrichta_TAGGACT_L008_R1_001.fastq

fea7d7a582cdfb64915d486ca39da9ebf7ef1d83 Csyrichta_TAGGACT_L008_R1_001.fastq

If our sequencing center says the checksum of the Csyrichta_TAGGACT_L008_R1_001.fastq.gz sequencing file is “069bf5894783db241e26f4e44201bd12f2d5aa42” and our local SHA checksum is “fea7d7a582cdfb64915d486ca39da9ebf7ef1d83,” we know our file differs somehow from the original version.

When downloading many files, it can get rather tedious to check each checksum individually. The program shasum has a convenient solution — it can create and validate against a file containing the checksums of files. We can create a SHA-1 checksum file for all FASTQ files in the data/ directory as follows:

$ shasum data/*fastq > fastq_checksums.sha

$ cat fastq_checksums.sha

524d9a057c51b1[...]d8b1cbe2eaf92c96a9 data/Csyrichta_TAGGACT_L008_R1_001.fastq

d2940f444f00c7[...]4f9c9314ab7e1a1b16 data/Csyrichta_TAGGACT_L008_R1_002.fastq

623a4ca571d572[...]1ec51b9ecd53d3aef6 data/Csyrichta_TAGGACT_L008_R1_003.fastq

f0b3a4302daf7a[...]7bf1628dfcb07535bb data/Csyrichta_TAGGACT_L008_R1_004.fastq

53e2410863c36a[...]4c4c219966dd9a2fe5 data/Csyrichta_TAGGACT_L008_R1_005.fastq

e4d0ccf541e90c[...]5db75a3bef8c88ede7 data/Csyrichta_TAGGACT_L008_R1_006.fastq

Then, we can use shasum’s check option (-c) to validate that these files match the original versions:

$ shasum -c fastq_checksums.sha

data/Csyrichta_TAGGACT_L008_R1_001.fastq: OK

data/Csyrichta_TAGGACT_L008_R1_002.fastq: OK

data/Csyrichta_TAGGACT_L008_R1_003.fastq: OK

data/Csyrichta_TAGGACT_L008_R1_004.fastq: OK

data/Csyrichta_TAGGACT_L008_R1_005.fastq: OK

data/Csyrichta_TAGGACT_L008_R1_006.fastq: FAILED

shasum: WARNING: 1 computed checksum did NOT match

In the event that the checksums of a file disagree, shasum will show you which file failed validation and exit with a nonzero error status.

The program md5sum (or md5 on OS X) calculates MD5 hashes and is similar in operation to shasum. However, note that on OS X, the md5 command doesn’t have the -c option, so you’ll need to install the GNU version for this option. Also, some servers use an antiquated checksum implementation such as sum or chsum. How we use these older command-line checksum programs is similar to using shasum and md5sum.

Finally, you may be curious how all files can be summarized by a 40-character-long SHA-1 checksum. They can’t — there are only 1640 possible different checksums. However, 1640 is a huge number so the probability of a checksum collision is very, very small. For the purposes of checking data integrity, the risk of a collision is negligible.

Looking at Differences Between Data

While checksums are a great method to check if files are different, they don’t tell us how files differ. One approach to this is to compute the diff between two files using the Unix tool diff. Unix’s diff works line by line, and outputs blocks (called hunks) that differ between files (resembling Git’s git diff command we saw in Chapter 4).

Suppose you notice a collaborator was working with a different version of a file than the one you’re using. Her version is gene-2.bed, and your version is gene-1.bed (these files are on GitHub if you’d like to follow along). Because downstream results depend on this dataset, you want to check if the files are indeed different. After comparing the SHA-1 checksums, you find the files aren’t identical. Before rerunning your analysis using your collaborator’s version, you’d like to check whether the files differ significantly. We can do this by computing the diff between gene-1.bed and gene-2.bed:

$ diff -u gene-1.bed gene-2.bed

--- gene-1.bed 2014-02-22 12:53:14.000000000 -0800 1

+++ gene-2.bed 2015-03-10 01:55:01.000000000 -0700

@@ -1,22 +1,19 @@ 2

1 6206197 6206270 GENE00000025907

1 6223599 6223745 GENE00000025907 3

1 6227940 6228049 GENE00000025907

+1 6222341 6228319 GENE00000025907 4

1 6229959 6230073 GENE00000025907

-1 6230003 6230005 GENE00000025907 5

1 6233961 6234087 GENE00000025907

1 6234229 6234311 GENE00000025907

1 6206227 6206270 GENE00000025907

1 6227940 6228049 GENE00000025907

1 6229959 6230073 GENE00000025907

-1 6230003 6230073 GENE00000025907 6

+1 6230133 6230191 GENE00000025907

1 6233961 6234087 GENE00000025907

1 6234229 6234399 GENE00000025907

1 6238262 6238384 GENE00000025907

-1 6214645 6214957 GENE00000025907

1 6227940 6228049 GENE00000025907

1 6229959 6230073 GENE00000025907

-1 6230003 6230073 GENE00000025907

1 6233961 6234087 GENE00000025907

1 6234229 6234399 GENE00000025907

-1 6238262 6238464 GENE00000025907

1 6239952 6240378 GENE00000025907

The option -u tells diff to output in unified diff format, which is a format nearly identical to the one used by git diff. I’ve chosen to use unified diffs rather than diff’s default diff format because unified diffs provide more context.

Unified diffs are broken down into hunks that differ between the two files. Let’s step through the key parts of this format:

1

These two lines are the header of the unified diff. The original file gene-1.bed is prefixed by ---, and the modified file gene-2.bed is prefixed by +++. The date and time in these two lines are the modification times of these files.

2

This line indicates the start of a changed hunk. The pairs of integers between @@ and @@ indicate where the hunk begins, and how long it is, in the original file (-1,22) and modified file (+1,19), respectively.

3

Lines in the diff that begin with a space indicate the modified file’s line hasn’t changed.

4

Lines in the diff that begin with a + indicate a line has been added to the modified file.

5

Similarly, - indicates lines removed in the modified file.

6

An adjacent line deletion and line addition indicates that this line was changed in the modified file.

Diff files appear very cryptic at first, but you’ll grow familiar with them over time. diff’s output can also be redirected to a file, which creates a patch file. Patch files act as instructions on how to update a plain-text file, making the changes contained in the diff file. The Unix tool patch can apply changes to a file needed to be patched. Patches are used more commonly in software development than bioinformatics, so we won’t cover them in detail. Lastly, it’s important to note that diffs can be computationally expensive to compute on large files, so be cautious when running diff on large datasets.

Compressing Data and Working with Compressed Data

Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across network transfers), is an indispensable technology in modern bioinformatics. For example, sequences from a recent Illumina HiSeq run when compressed with Gzip take up 21,408,674,240 bytes, which is a bit under 20 gigabytes. Uncompressed, this file is a whopping 63,203,414,514 bytes (around 58 gigabytes). This FASTQ file has 150 million 200bp reads, which is 10x coverage of the human genome, 190x coverage of the Arabidopsis genome, or a measly 2x coverage of the hexaploid wheat genome. The compression ratio (uncompressed size/compressed size) of this data is approximately 2.95, which translates to a significant space saving of about 66%. Your own bioinformatics projects will likely contain much more data, especially as sequencing costs continue to drop and it’s possible to sequence genomes to higher depth, include more biological replicates or time points in expression studies, or sequence more individuals in genotyping studies.

For the most part, data can remain compressed on the disk throughout processing and analyses. Most well-written bioinformatics tools can work natively with compressed data as input, without requiring us to decompress it to disk first. Using pipes and redirection (covered in Chapter 3), we can stream compressed data and write compressed files directly to the disk. Additionally, common Unix tools like cat, grep, and less all have variants that work with compressed data, and Python’s gzip module allows us to read and write compressed data from within Python. So while working with large datasets in bioinformatics can be challenging, using the compression tools in Unix and software libraries make our lives much easier.

gzip

The two most common compression systems used on Unix are gzip and bzip2. Both have their advantages: gzip compresses and decompresses data faster than bzip2, but bzip2 has a higher compression ratio (the previously mentioned FASTQ file is only about 16 GB when compressed with bzip2). Generally, gzip is used in bioinformatics to compress most sizable files, while bzip2 is more common for long-term data archiving. We’ll focus primarily on gzip, but bzip2’s tools behave very similarly to gzip.

The command-line tool gzip allows you to compress files in a few different ways. First, gzip can compress results from standard input. This is quite useful, as we can compress results directly from another bioinformatics program’s standard output. For example, suppose we have a program that removes low-quality bases from FASTQ files called trimmer (this is an imaginary program). Our trimmer program can handle gzipped input files natively, but writes uncompressed trimmed FASTQ results to standard output. Using gzip, we can compress trimmer’s output in place, before writing to the disk:

$ trimmer in.fastq.gz | gzip > out.fastq.gz

gzip takes input from standard in, compresses it, and writes this compressed output to standard out.

gzip also can compress files on disk in place. If our in.fastq.gz file weren’t compressed, we could compress it as follows:

$ ls

in.fastq

$ gzip in.fastq

$ ls

in.fastq.gz

gzip will compress this file in place, replacing the original uncompressed version with the compressed file (appending the extension .gz to the original filename). Similarly, we can decompress files in place with the commandgunzip:

$ gunzip in.fastq.gz

$ ls

in.fastq

Note that this replaces our in.fastq.gz file with the decompressed version (removing the .gz suffix, too). Both gzip and gunzip can also output their results to standard out (rather than changing files in place). This can be enabled using the -c option:

$ gzip -c in.fastq > in.fastq.gz

$ gunzip -c in.fastq.gz > duplicate_in.fastq

A nice feature of the gzip compression algorithm is that you can concatenate gzip compressed output directly to an existing gzip file. For example, if we wanted to compress the in2.fastq file and append it to our compressedin.fastq.gz file, we wouldn’t have to decompress in.fastq.gz first, concatenate the two files, and then compress the concatenated file. Instead, we can do the following:

$ ls

in.fastq.gz in2.fastq

$ gzip -c in2.fastq >> in.fastq.gz

Importantly, note that the redirection operator we use is >>; had we used >, we would overwrite our compressed version of in2.fastq to in.fastq.gz (rather than append to it). Always exercise caution when using redirection, and make sure you’re using the appropriate operator (and keep file backups!). You may get a slightly better compression ratio by compressing everything together (e.g., with cat in.fastq in2.fastq | gzip > in.fastq.gz), but the convenience of appending to an existing gzipped file is useful. Also, note that gzip does not separate these compressed files: files compressed together are concatenated. If you need to compress multiple separate files into a single archive, use the tar utility (see the examples section of man tar for details).

Working with Gzipped Compressed Files

Perhaps the greatest advantage of gzip (and bzip2) is that many common Unix and bioinformatics tools can work directly with compressed files. For example, we can search compressed files using grep’s analog for gzipped files, zgrep. Likewise, cat has zcat (on some systems like OS X, this is gzcat), diff has zdiff, and less has zless. If programs cannot handle compressed input, you can use zcat and pipe output directly to the standard input of another program.

These programs that handle compressed input behave exactly like their standard counterpart. For example, all options available in grep are available in zgrep:

$ zgrep --color -i -n "AGATAGAT" Csyrichta_TAGGACT_L008_R1_001.fastq.gz

2706: ACTTCGGAGAGCCCATATATACACACTAAGATAGATAGCGTTAGCTAATGTAGATAGATT

There can be a slight performance cost in working with gzipped files, as your CPU must decompress input first. Usually, the convenience of z-tools like zgrep, zless, and zcat and the saved disk space outweigh any potential performance hits.

Case Study: Reproducibly Downloading Data

Downloading data reproducibly can be deceptively complex. We usually download genomic resources like sequence and annotation files from remote servers over the Internet, which may change in the future. Furthermore, new versions of sequence and annotation data may be released, so it’s imperative that we document everything about how data was acquired for full reproducibility. As a demonstration of this, let’s step through a case study. We’ll download a few genomic and sequence resources for mouse (Mus musculus) and document how we acquired them.

For this example, we’ll download the GRCm38 mouse reference genome and accompanying annotation. Note that this case study involves downloading large files, so you may not want to follow along with these examples. The mouse, human (Homo sapiens), and zebrafish (Danio rerio) genomes releases are coordinated through the Genome Reference Consortium. The “GRC” prefix in GRCm38 refers to the Genome Reference Consortium. We can download GRCm38 from Ensembl (a member of the consortium) using wget. For this and other examples in this section, I’ve had to truncate the URLs so they fit within a book’s page width; see this chapter’s README.md on GitHub for the full links for copying and pasting if you’re following along.

$ wget ftp://ftp.ensembl.org/[...]/Mus_musculus.GRCm38.74.dna.toplevel.fa.gz

Ensembl’s website provides links to reference genomes, annotation, variation data, and other useful files for many organisms. This FTP link comes from navigating to http://www.ensembl.org, clicking the mouse project page, and then clicking the “Download DNA sequence” link. If we were to document how we downloaded this file, our Markdown README.md might include something like:

Mouse (*Mus musculus*) reference genome version GRCm38 (Ensembl

release 74) was downloaded on Sat Feb 22 21:24:42 PST 2014, using:

wget ftp://ftp.ensembl.org/[...]/Mus_musculus.GRCm38.74.dna.toplevel.fa.gz

We might want to look at the chromosomes, scaffolds, and contigs this files contains as a sanity check. This file is a gzipped FASTA file, so we can take a quick peek at all sequence headers by grepping for the regular expression"^>", which matches all lines beginning with > (a FASTA header). We can use the zgrep program to extract the FASTA headers on this gzipped file:

$ zgrep "^>" Mus_musculus.GRCm38.74.dna.toplevel.fa.gz | less

Ensembl also provides a checksum file in the parent directory called CHECKSUMS. This checksum file contains checksums calculated using the older Unix tool sum. We can compare our checksum values with those in CHECKSUMS using the sum program:

$ wget ftp://ftp.ensembl.org/pub/release-74/fasta/mus_musculus/dna/CHECKSUMS

$ sum Mus_musculus.GRCm38.74.dna.toplevel.fa.gz

53504 793314

The checksum 53504 agrees with the entry in the CHECKSUMS file for the entry Mus_musculus.GRCm38.74.dna.toplevel.fa.gz. I also like to include the SHA-1 sums of all important data in my data README.md file, so future collaborators can verify their data files are exactly the same as those I used. Let’s calculate the SHA-1 sum using shasum:

$ shasum Mus_musculus.GRCm38.74.dna.toplevel.fa.gz

01c868e22a981[...]c2154c20ae7899c5f Mus_musculus.GRCm38.74.dna.toplevel.fa.gz

Then, we can copy and paste this SHA-1 sum into our README.md. Next, we can download an accompanying GTF from Ensembl and the CHECKSUMS file for this directory:

$ wget ftp://ftp.ensembl.org/[...]/Mus_musculus.GRCm38.74.gtf.gz

$ wget ftp://ftp.ensembl.org/[...]/CHECKSUMS

Again, let’s ensure that our checksums match those in the CHECKSUMS file and run shasum on this file for our own documentation:

$ sum Mus_musculus.GRCm38.74.gtf.gz

00985 15074

$ shasum cf5bb5f8bda2803410bb04b708bff59cb575e379 Mus_musculus.GRCm38.74.gtf.gz

And again, we copy the SHA-1 into our README.md. So far, our README.md might look as follows:

## Genome and Annotation Data

Mouse (*Mus musculus*) reference genome version GRCm38 (Ensembl

release 74) was downloaded on Sat Feb 22 21:24:42 PST 2014, using:

wget ftp://ftp.ensembl.org/[...]/Mus_musculus.GRCm38.74.dna.toplevel.fa.gz

Gene annotation data (also Ensembl release 74) was downloaded from Ensembl on

Sat Feb 22 23:30:27 PST 2014, using:

wget ftp://ftp.ensembl.org/[...]/Mus_musculus.GRCm38.74.gtf.gz

## SHA-1 Sums

- `Mus_musculus.GRCm38.74.dna.toplevel.fa.gz`: 01c868e22a9815c[...]c2154c20ae7899c5f

- `Mus_musculus.GRCm38.74.gtf.gz`: cf5bb5f8bda2803[...]708bff59cb575e379

Although this isn’t a lot of documentation, this is infinitely better than not documenting how data was acquired. As this example demonstrates, it takes very little effort to properly track the data that enters your project, and thereby ensure reproducibility. The most important step in documenting your work is that you’re consistent and make it a habit.