Consuming HTML Tables with Powershell

Occasionally, you may find your self in a situation where a website has all the data you need, but the data is in one or more HTML tables and not really consumable. You could try copy and paste it into Microsoft Excel, but really you want the data in Powershell.

For my example, I am using the Microsoft Product Life Cycle search page. This page has all the data we want, but it is an HTML table. How can we get this pragmatically?

Lets start with Invoke-WebRequest.

$page = Invoke-WebRequest $URL

This will nicely get us the entire page and store it in an object. Our data is in there somewhere.
We know our data is in a table, so lets get all of the tables:

$tables = $page.ParsedHtml.body.getElementsByTagName('Table')

Since we are looking for HTML tables that have data, we’ll get only tables that have cells.

$tableswithcells = $tables | Where{$_.cells}

Now it is simply a matter of parsing each row of cells in each table and storing the data in an object.
Assuming the first row of our table is a header, we want to keep track of the header values since this will be the property names of our object.

if($rownum -eq 0){
     ForEach($cell in $row.cells){
          $arrTableHeader += $cell.InnerText.Trim()
     }
}

For each row in the table (except the first row), we need to parse each cell. If the header is not null, we’ll add it as by the header name otherwise, we’ll add it as an index.

$cellnum = 0
$hashRow = @{}
ForEach($cell in $row.cells){
         $strHeader = $arrTableHeader[$cellNum]
         If($strHeader){
             $hashRow.Add($strHeader,$cell.innertext)
         }else{
             #If the header is null store it by cell number instead
             $hashRow.Add($cellnum,$cell.innertext)
         }
         $cellnum++
}
#Save the row as a custom ps object
$objRow = New-object -TypeName PSCustomObject -Property $hashRow
$arrTable += $objRow

In the end, we have a custom object for our page that contains custom objects for each table.
Wrapping this all up into a function:

Function Get-HTMLTables{
 <#
        .SYNOPSIS   Gets all the text in HTML tables on a webpage and returns it as an object of arrays
        .DESCRIPTION
            This function treats 
                1. table cells as properties of row objects
                2. Rows as custom objects with cells as properties
                3. Tables as arrays of rows
                4. Pages as custom objects with numbered tables as properties.
            Returns
                -Custom object with numbered properties
                -Each numbered property represents a table number from the top of the page down
                -Each numbered property contains an array of row custom objects
                -Each object in the array has properties representing each cell in the row.
        .PARAMETER URL
            The URL to look for HTML tables in
        .PARAMETER FirstRowHeader
            Whether or not to treat the first row of the table as a header
        .EXAMPLE
            $objPage = Get-HTMLTables -URL 'https://support.microsoft.com/en-us/lifecycle/search?sort=PN&alpha=windows%2010&Filter=FilterNO' -FirstRowHeader $true
        .EXAMPLE
            $objPage.21 | select -first 1
                           
            Extended Support End Date     : 10/14/2025
            Lifecycle Start Date          : 7/29/2015
            Products Released             : Visual Studio Express 2015 for Windows 10
            Mainstream Support End Date   : 10/13/2020
            Service Pack Support End Date : 
            Notes                         : 
        .EXAMPLE
            $objPage.21 | Where{$_.('Products Released') -match 'Education'}
            
            Extended Support End Date     : 10/14/2025
            Lifecycle Start Date          : 7/29/2015
            Products Released             : Windows 10 Education, released in July 2015
            Mainstream Support End Date   : 10/13/2020
            Service Pack Support End Date : 
            Notes : • Updates are cumulative, with each update built upon all of the updates that preceded it. A device needs to 
                    install the latest update to remain supported. 
                    • Updates may include new features, fixes (security and/or non-security), or a combination of both. Not all 
                    features in an update will work on all devices. 
                    • A device may not be able to receive updates if the device hardware is incompatible, lacking current 
                    drivers, or otherwise outside of the Original Equipment Manufacturer’s (“OEM”) support period. 
                    • Update availability may vary, for example by country, region, network connectivity, mobile operator (e.g., 
                    for cellular-capable devices), or hardware capabilities (including, e.g., free disk space).  
#>
Param(
[uri]$URL = 'https://support.microsoft.com/en-us/lifecycle/search?sort=PN&alpha=windows%2010&Filter=FilterNO',
[boolean]$firstRowHeader = $false
)
   

#Get the webpage
$page = Invoke-WebRequest $URL

#Filter out only the tables
$tables = $page.ParsedHtml.body.getElementsByTagName('Table')

#Get only the tables that have cells
$tableswithcells = $tables | Where{$_.cells}
$hashPage = @{}
$tablecount = 0

#ForEach table
ForEach($table in $tableswithcells){
    $arrTable = @()
    $rownum = 0
    $arrTableHeader = @()
    #Get all the rows in the tables
    ForEach($row in $table.rows){
        #Treat the first row as a header
        if($rownum -eq 0 -and $firstRowHeader){
            ForEach($cell in $row.cells){
                $arrTableHeader += $cell.InnerText.Trim()
            }
            #If not the first row, but using headers, store the value by header name
        }elseIf($firstRowHeader){
            $cellnum = 0
            $hashRow = @{}
            ForEach($cell in $row.cells){
                $strHeader = $arrTableHeader[$cellNum]
                If($strHeader){
                    $hashRow.Add($strHeader,$cell.innertext)
                }else{
                    #If the header is null store it by cell number instead
                    $hashRow.Add($cellnum,$cell.innertext)
                }
                $cellnum++
            }
            #Save the row as a custom ps object
            $objRow = New-object -TypeName PSCustomObject -Property $hashRow
            $arrTable += $objRow
            #if not the first row and not using headers, store the value by cell index
        }else{
            $cellnum = 0
            $hashRow = @{}
            ForEach($cell in $row.cells){
                $hashRow.Add($cellnum,$cell.innertext)
                $cellnum++
            }
            #Store the row as a custom object
            $objRow = New-object -TypeName PSCustomObject -Property $hashRow

            #Add the row to the array of rows
            $arrTable += $objRow
        }
        $rownum++
    }
    #Add the tables to the hashtable of tables
    $hashPage.Add($tablecount,$arrTable)
    $tablecount++
}
$objPage = New-object -TypeName PSCustomObject -Property $hashPage
Return $objPage
}

Running this function:

$objPage = Get-HTMLTables -firstRowHeader $true -url 'https://support.microsoft.com/en-us/lifecycle/search?sort=PN&alpha=windows%2010&Filter=FilterNO'

Keeping mind that there will be multiple tables on the page and only of these tables has the data we’re looking for. At this point, you’ll want to filter the output based on the website you are parsing. In my specific case, the table I’m looking for is table #21.

$objpage.21 | Where{$_.('Products Released') -match 'Education'}

<# Output:
Extended Support End Date     : 10/14/2025
Lifecycle Start Date          : 7/29/2015
Products Released             : Windows 10 Education, released in July 2015
Mainstream Support End Date   : 10/13/2020
Service Pack Support End Date : 
Notes                         : • Updates are cumulative, with each update built upon all of the updates that preceded it. A device needs to 
                                install the latest update to remain supported. 
                                • Updates may include new features, fixes (security and/or non-security), or a combination of both. Not all 
                                features in an update will work on all devices. 
                                • A device may not be able to receive updates if the device hardware is incompatible, lacking current 
                                drivers, or otherwise outside of the Original Equipment Manufacturer’s (“OEM”) support period. 
                                • Update availability may vary, for example by country, region, network connectivity, mobile operator (e.g., 
                                for cellular-capable devices), or hardware capabilities (including, e.g., free disk space).  
#>
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s