Full-Text Search: PowerShell meet Lucene

Bruce Payette, co-founder of PowerShell, gave a talk on integrating full-text search using Lucene with PowerShell at the PowerShell Summit.

Refactoring

So I did a little refactoring and wrapped it in GUI, this is all PowerShell.

In Action

It indexed the content of 2500+ files in ~2 seconds.

Index and Search

Type in the name of the directory to be searched, including a filter, c:\posh\*.ps1, press enter and it will recursively search the directory for all ps1 files and index the contents, keeping the index in memory. You can also search multiple directories with different filters. E.g. c:\temp\*.cs,c:\test\*.ts,c:\arm\*.json

Then, you can search for a term across everything that was indexed by typing in the Query box and pressing enter.

What is Lucene?

Apache Lucene is a free and open-source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Who uses Lucene

  • Apache SOLR (Used by OMS Operational Insights)
  • Elastic Search (everybody uses this)
  • CIteSeerX
  • Apple
  • 7digital (digital media)
  • Comcast
  • Disney

On GitHub

Grab it all from my GitHub Repo.

Love the quick one off automation you can do in PowerShell

I use markdown for lots of things. Note taking, blog posts, read me files for my GitHub repos and more.

Typically I launch MarkdownPad, start typing and then do a File|Save, navigate to the directory where I want it and save it.

Too Much Work

That workflow opens itself to lots of missteps. Eye hand coordination problems, fat fingering the directory where I want it saved, etc.

So, let’s whittle that down to 10 characters (less if you use tab completion) plus the name of the file.

New-MDFile blogEntry

The function adds the '.md' extension, then creates it, with the proper encodoing and finally does an Invoke-Item on the file name so it launches MarkdownPad with it ready for editing.

Select-String

On another note, I’ve been working with TypeScript recently. Often, I need to find text across multiple files. The typescript files are organized across directories and in subdirectories.

Rather then repeatably typing ls . -r *.ts and piping it to Select-String with the pattern you’re looking for (too much work). You create a function and combine these operations into one. The search string is optional, fts will return a list of typescript files if it is not specified.

fts showInformationMessage

Question: How would you adapt this to work with PowerShell files? C# files? Other files?

Harvesting Web Data using PowerShell

Grab the PowerShell and then you can type Get-Top100Songs 1983, passing in the year you’d like to see, and here’s the song list

The PowerShell

Harvesting the data is a two step process. The first step is using Invoke-WebRequest to grab the html we want to wrangle. You hit the pop culture website to get the top 100 songs for the year you want. Invoke-WebRequest has a property AllElements which returns, well, all the elements on that page. Piping this list of elements we can use Where to find only elements that have the TagName of tbody. The innerhtml property returns a string of html.

Now you can shift gears to the second and final step, parse the actual html using example based parsing. The $t variable contains is the template for parsing the html. This template is passed to the new PowerShell v5.0 ConvertFrom-String. The songs returned from the web site are structured. Each song sits in a <TR></TR>, then the position, group and song sit in <TD> tags. Using the ConvertFrom-String template language, you mark the data so the example can be used to parse the html. This is marking the example {pos:} {group:} and {song:}. Also included in the example template, are records that are different. This lets ConvertFrom-String properly construct a domain specific language to parse the data.

PowerShell ConvertFrom-String: Serious Text wrangling

There’s a new PowerShell cmdlet ConvertFrom-String released with PowerShell v5.0. There are a bunch of write ups on using this cmdlet and I want to show how it makes quick work of HTML source.

HTML Source

Sometimes you’ll get html that looks like the snippet below (remember, it could be hundreds an hundreds of lines of HTML). So editing/transforming it by hand would take quite some time.

The Transform

Let’s say we wanted to go from the HTML above, to this:

I’ve written code (or used a macro recorder in a text editor) to find the first ‘>’, delete the text to the left, find the ‘(‘ grab the text I want, etc.

The challenge is, not all the people I work with know how to do this. Plus, there are many other (mundane) text reformatting tasks that people go through every day.

Enter ConvertFrom-String

The key here is the $template on starting on line 17. I’m using ConvertFrom-String to do example-driven parsing. The template provides the example (hints to ConvertFrom-String on what I want extracted).

I put curly braces around the data I want to extract, and give it a name Item and Count. The * tells ConvertFrom-String this should result in multiple records.

image

image

The data is piped to ConvertFrom-String, parsed an then piped to ForEach which does the final transform.

That, is slick and easy.

Note: I’m providing the data and template in the code. Both the data and template can be in separate external files so script could be run to do transforms over many inputs.

Check Out ConvertFrom-String Buddy

I created a GUI (Using PowerShell and WPF), you can get the script HERE.

It lets you quickly and easily experiment with ConvertFrom-String.

Paste the data you want to transform in the data text box (on the left). Start typing the example template in the template text box (on the right). As you type, you’ll immediately see results in the result text box.

Plus, it generates the PowerShell code as you go. You can copy that to the clipboard and save it as a script for later.

PowerShell Show-Map

This PowerShell function launches a map in the browser using an address from the command line or the clipboard.

Plus, it checks which version of PowerShell is running and if it is v5 or later, will use the new Get-Clipboard cmdlet. If it is an older version of PowerShell, it will use .NET to get the text from the clipboard.

The default map url is Google, specify the -UseBing switch parameter to launch the map using Bing.