How to extract GenBank accession numbers from files using Python: Part 1

Datamining

Introduction

If you’re like me.  You hate it when you need to extract information from a web page or a PDF (cue lots of cutting and pasting and frustration at selecting the wrong bits of text).  For example.  In my job I often get handed a bunch of academic journals in PDF format and am asked to download the sequences off of the NCBI’s GenBank database.  Or, I have a list of “haplotypes” and associated accession numbers in HTML format from using FaBox to collapse my nucleotide alignments.

There is an “easy” way to do it using a form of simple data mining implemented in the python programming language.  If you can install python and know the basics of terminal use in Linux or Mac, then you can easily simplify and automate your data retrieval jobs using this tutorial.

Assumptions

For the purposes of this tutorial I am going to assume you’re either using a Linux based machine e.g. Ubuntu or OS X (Mac) and that you have Python2.7 installed.

The Code

Copy and paste the following script into a text editor and save it as a “.py” file.

 #import python dependencies
 import re
 import sys
 import os
 from os import path

#define argument fed to script via the command line
 data_path = sys.argv[1]

#create a output file in the directory of the input file using it's name as a template
 output_path = open(path.join(path.dirname(data_path),path.splitext(path.basename(data_path))[0] +'_accession_numbers.txt'),'w')

#if the file is a pdf file, create a secondary output file to convert pdf to plain text before reading
 if "pdf" in data_path or "PDF" in data_path:
 os.system(("pdftotext '%s' '%s'") %(data_path , path.join(path.dirname(data_path),path.splitext(path.basename(data_path))[0] +'_pdf.txt')))

#read converted pdf file into a variable
 with open(path.join(path.dirname(data_path),path.splitext(path.basename(data_path))[0] +'_pdf.txt'), 'r') as content_file:
 content = content_file.read()

os.remove(path.join(path.dirname(data_path),path.splitext(path.basename(data_path))[0] +'_pdf.txt'))

#if file is in any other format just open and read file into a variable
 else:
 with open(data_path, 'r') as content_file:
 content = content_file.read()

#use a regular expression to find all instances of words fitting the pattern for GenBank accession numbers e.g. 1 letter followed by 5 numerals or 2 letters followed by 6 numerals
 accession_list = re.findall('[A-Z][A-Z]?[0-9][0-9][0-9][0-9][0-9][0-9]?', content)

#write each accession number found to the output file
 for accession in accession_list:
 output_path.write(accession+"\n")

How to use it?

To use the code above, simply open up a terminal window and change directory (cd) to the location where the script is saved.  Then type the following and press enter:

python script_name.py file_you_want_to_extract.html

Change ‘script_name.py’ to the file name you saved the code as and ‘file_you_want_to_extract.html’ to the file name (with the full path) you want to extract accession numbers from.

If the file is readable by the script you should find a new file in the same directory as the extracted file ending ‘_accession_numbers.txt’.  You should then find it full of those accession numbers that you were after.

How it works?

Most of the program code is for reading an input file into memory, testing its format (taking appropriate action) , and producing a list of key words it finds in the text as an output file.

The key line of code in this script is the regular expression used in line 29.

Here we use the ‘re.findall()’ function.  This takes two parameters. The first, we define a regular expression to describe the pattern of set of words we’re looking for.  The second, we define the variable which contains all of the text read from the input file.

In this case, NCBI always writes it’s accession numbers in one of two ways:

  1. 1 upper-case letter followed by 5 numerals e.g. A12345
  2. 2 upper-case letters followed by 6 numerals e.g. AA123456

The regular expression which expresses this is as follows:

‘[A-Z][A-Z]?[0-9][0-9][0-9][0-9][0-9][0-9]?’

You can see here a fixed upper-case letter is always expected as the first character, followed by an optional second upper-case letter as the second character.  Next we always expect to find five numerals, followed by an optional sixth numeral.  It is the use of ‘?’ after a character which makes it optional.

Using this regular expression, the ‘re.findall()’ function finds all the instances in the supplied text where the regular expression matches a string of characters.  These can be printed to the terminal, or in our case we populate what’s called a list variable to write to an output file.

Summary

Regular expressions in python can be used to pin point and gather all instances of any string of characters you can think of in a file without the need to read it yourself and rely upon manually cutting and pasting.  This is a very powerful tool if you know the things you want to extract e.g. GenBank accession numbers or postcodes always follow the same set of rules.  The code above can be easily edited to change the type of regular expression you’re looking for, and even utilised in a software ‘pipeline’ to be used iteratively (something which I will cover in part 2).  Using the above example will allow the user to extract accession numbers for use in genomics from papers in PDF format, HTML tables, plain text and more.