PowerShell, An Exercise in Species Barcoding

by Doug Finke on February 17, 2009

in Peter Norvig, PowerShell

Peter Norvig, Director of Research at Google, blogged An Exercise in Species Barcoding. What I like about the post is it touches on Bio Informatics and Mr. Norvig provides the Python scripts which does several metrics on the DNA sequences. Also of interest is his choice points when to drop into Java code.

I decided this would be a good challenge in PowerShell. Here are the results of the aggregate statistics from the PowerShell run. I didn’t build the Multi-named genomes one yet. The stats match except for the number of distinct genomes, Mr. Norvig reports 801. I need to investigate.

Aggregate Statistics

Number of genomes: 1248 (291 distinct)
Genome lengths: min=659, max=659
Character counts: -:1531 A:249466 C:126358 G:120017 N:1505 T:323555

In a future post I will show the Levenshtein distance metric implementation. This measures the difference between two sequences of characters.

PowerShell Code

Here is the data set of 1248 barcode sequences, all of them Lepidoptera (butterflies and moths) from Australia.

Function Get-Genomes {
    param($fileName = "$pwd\barcodingdata.txt")
    
    $list = [io.File]::ReadAllText( $fileName  ) -split '>(.*?)\r([^\r]*)\r*' 
    
    while ($list) {
        $obj = "" | Select Name, Species, Genome        
        $null, $obj.Name, $obj.Genome, $list = $list        
        $obj.Species = $obj.Name.Split('|')[-1].Trim()        
        $obj
    }
}
 
Function Get-CharacterCounts {
    param($target)
    
    $h=@{}
    ForEach($g in $target) {
        ForEach($c in $g.Genome.ToCharArray()) {
            $h[$c]+=1    
        }
    }
    $h
}
 
Function Get-GenomeReport {
    param($genomes=(Get-Genomes))
    
    "Number of genomes: $($genomes.Count) ($(($genomes | sort species -unique).Count) distinct)"
    $m = $genomes | % {$_.genome.length} | Measure-Object -Minimum -Maximum
    "Genome lengths: min=$($m.Minimum), max=$($m.Maximum)"
    write-host -NoNewline "Character counts: "
    (Get-CharacterCounts $genomes).GetEnumerator() | sort key | % { Write-Host -NoNewline "$($_.key):$($_.value) "}
    write-host
}
 
Get-GenomeReport

{ 1 trackback }

Speeling problems? Trie F#. » Lab49 Blog
03.09.09 at 11:13 am

{ 2 comments… read them below or add one }

1 Kalani 02.20.09 at 10:38 pm

Levenshtein eh? I didn’t know that there was a particular person associated with the edit-distance function.

2 Doug Finke 02.20.09 at 10:54 pm

Good point. I lifted that from Peter Norvig’s post. Here is the Wikipedia for Edit Distance

Hamming distance
Levenshtein distance
Damerau–Levenshtein distance
Jaro-Winkler distance
Wagner-Fischer edit distance
Ukkonen
Hirschberg’s algorithm

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>