How to convert SE-AL files into FASTA format using Python!

Hello and welcome to this quick tutorial on how to use python to convert SE-AL files into a more universal format such as FASTA!

Have you ever been the only one not using a Mac in your aDNA or bioinformatics group?

I am, and it’s super annoying when people provide you with alignment files that are apparently in SE-AL format (an old alignment program not even usable past Mac OS X 10.7).  Even you Mac users may be interested in this tutorial if you have moved on from Snow Leopard, but you may have to adapt the code.

If you can’t figure out what format your alignment file is in and a Mac user sent it to you.  It’s probably in SE-AL format.  But for reference it looks something like this:

 Database={ 
    ID='Blablabla'; 
    Owner=null; 
    Name=null; 
    Description=null; 
    Flags=0; 
    Count=2; 
    { 
        { 
            ID='Blablabla'; 
            Owner=1; 
            Name=null; 
            Description=null; 
            Flags=0; 
            NumSites=876; 
            Type="Nucleotide"; 
            Features=null; 
            ColourMode=1; 
            LabelMode=0; 
            triplets=false; 
            inverse=true; 
            Count=68; 
            { 
                { 
                    ID='PSeq'; 
                    Owner=2; 
                    Name="Blablabla"; 
                    Description=""; 
                    Flags=0; 
                    Accession=""; 
                    Type="DNA"; 
                    Length=20; 
                    Sequence="ACTCGCTCGCTAGATAGATA"; 
                    GeneticCode=-1; 
                    CodeTable=null; 
                    Frame=1; 
                    Features=null; 
                    Parent=null; 
                    Complemented=false; 
                    Reversed=false; 
                } etc etc etc

Unfortunately, this isn’t very useful for the majority of genetics software which simply responds to a file import with something like “WTF?”.

So, I spent an hour writing the python script below, using only native modules and python 2.7x.

To use this code simply copy the code below and save it to a python file (.py) extension.  Open a terminal, navigate to the folder the script is in and execute it remembering to put the file name including the path as an argument e.g. python python_script.py “/home/user/folder/se-al_file”.  If it worked you should have all the sequences outputted in a FASTA file format in the same location as the file that you specify as an argument.  If you’re interested in how the script works simply read the “helpful” comments I have included.  But basically it makes heavy use of regular expressions to perform the magic!

#import the regular expressions, system, and operating system modules
import re, sys, os

#function that finds sequence name in the block and returns it with fasta format
def seq_name(block):
    name_line = re.findall('[N][a][m][e][=]".*?"',block)
    name = re.findall('".*?"',name_line[0])
    return ">"+name[0].strip("\"")

#function that find sequence string in the block and returns it with fasta format with 80 chars per line
def seq_seq(block):
    seq_line = re.findall('[S][e][q][a-z].*?".*?"',block)
    seq = re.findall('".*?"',seq_line[0])
    return re.sub("(.{80})", "\\1\n", seq[0], 0, re.DOTALL).strip("\"")

#grab file name from terminal variable
file_name = sys.argv[1]

#open input file using file name
input_file = open(file_name,"r")

#read input file into a string variable
s = input_file.read()

#split the file into blocks based on a '{' character
blocks = s.split("{")

#define a dictionary to store sequence names as keys and sequences as values
seq_dict = {}

#for each block in the block array use only blocks that have the string 'PSeq' in them and call two functions to extract seq name and seq string, returning each in fasta format and adding them to the dictionary
for block in blocks:
    if "PSeq" in block: 
        seq_dict[seq_name(block)] = seq_seq(block)

#open output file using input file path
output_file = open(os.path.join(os.path.dirname(file_name),os.path.splitext(os.path.basename(file_name))[0] + ".fasta"),'w')

#for each key and value in the dictionary, output them to a fasta file in the same directory as the input file
for key, val in seq_dict.iteritems():
    output_file.write(key+"\n")
    output_file.write(val+"\n\n")