More Data Wrangling with PowerShell

It’s estimated that data scientists spend 80% of their time converting data into a usable form.

Combining PowerShell’s ability for getting content from a web page and converting text to objects using the new “parse by example” functions is a potent cocktail for scrubbing and making data useable.

Let’s scrape this page for the Presidents of the United States, extract the name, plus starting and ending term. Then, we’ll used PowerShell’s Add-Member to calculate the TotalYears in office.

How to

Using Invoke-WebRequest cmdlet and the AllElements property gets content of the page. Piping the results to Where to find the the HTML element where the class attribute is equal to potus75Width (you can discover this by viewing the source of the web page).

The result of the Where is piped to ConvertFrom-String which uses the $template to parse the data by example. In the template you identify the Name, TermStart and TermEnd properties. Notice in the template you can tell ConvertFrom-String the data type you want the property to be cast to, here you cast TermStart and TermEnd to [int].

In the last step you pipe the objects that have been converted from text to ForEach so you can add a ScriptProperty named TotalYears. The Value of this property subtracts TermEnd from TermStart.

The PowerShell Code


You pulled down the html from the web page, searched for specific data on that page and passed the results to ConvertForm-String with a template showing an example of how you want the data extracted. For the final step, you enriched the data by adding a property that calculates the total years in office.


Next Steps

Now that the data has been converted into a usable form, you can easily export it to SQL, CSV, JSON or create an Excel spreadsheet with this PowerShell Excel Module. As a bonus, this PowerShell module can create Excel spreadsheets and Excel does not need to be installed on that machine. Perfect for server side operations.


Leave a Reply

Your email address will not be published. Required fields are marked *