PowerShell – Find the K most common words in a file

by Doug Finke on February 7, 2009

in PowerShell

Jon Bentley says Doug McIlroy did it six lines of code in a UNIX Shell language.

Bentley, "Little languages", Communications of the ACM, 29(8):711-21, August> 1986

Here are 19 lines of PowerShell

Function Get-Top6Words {
    param ($fileName="$pwd\big.txt")
    
    Function train($text)
    {
        $h = @{}
        $text = [string]::join(' ', $text)
        
        ForEach ($word in [regex]::split($text.ToLower(), ‘\W+’) ) {
            $h[$word] += 1
        }
                    
        $h
    }
 
    (train ([System.IO.File]::ReadAllLines($fileName))).GetEnumerator() | 
        Sort-Object value -Descending | 
        Select-Object -First 6
}

Read a big.txt file.

{ 1 trackback }

Use PowerShell » Finding the K Most Common Words in a File
02.07.09 at 2:38 pm

{ 2 comments… read them below or add one }

Shay Levy 02.07.09 at 2:11 pm

How about one line :)

[regex]::split([io.file]::readAllText($fileName).ToLower(),’\W+’) | group -NoElement | sort count -desc | select -first 6

Doug Finke 02.07.09 at 4:03 pm

Thank you gentlemen, great updates.

I used the code I did in Spelling corrector, in vanilla PowerShell which I modelled after Peter Norvig’s python code they use in Google. How to Write a Spelling Corrector

I didn’t shortcut my thinking.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>