Sunday, 04 September 2005

In an earlier post, I introduced a download manager in MSH.  (I have since updated it.  If you use the script, you might want to download the update.)  The major pain with it, though, is getting the URLs into the text file required by the download manager.  “Right-click, copy link location, paste” just doesn’t cut it for more than a few links.  To resolve this problem, we’ll write another script to parse URLS out of the locally-saved HTML of a web page.

MSH:48 C:\Temp >$userAgent = "Monad Shell User"
MSH:49 C:\Temp >$wc = new-object System.Net.WebClient
MSH:50 C:\Temp >$wc.Headers.Add("user-agent", $userAgent)
MSH:51 C:\Temp >$wc.DownloadString("
http://channel9.msdn.com") > temp.html
MSH:52 C:\Temp >parse-urls temp.html
http://channel9.msdn.com/ wmv$
mms://wm.microsoft.com/ms/msnse/0508/25408/bill_staples_iis7_2005_MBR.wmv
http://download.microsoft.com/download/c/3/9/c39e98c3-03b7-4fa1-959a-8116e3ceb1e3
/bill_staples_iis7_2005.wmv

Now, links are represented in HTML pages as anchor tags: usually something like

<a href=”url”>description</a>

However, there are many variables that get in the way of the simple parsing required by the example above: quote style, and CSS decorations, to name a few.  This calls for some heavy pattern matching in text; a problem usually solved by regular expressions.  (For an overview of regular expressions in Monad see my earlier post.)  In fact, almost all of the heavy lifting in this script is done through a single regular expression.

Now, regular expressions are notoriously fiddly things.  It’s hard enough to get them to work while you’re writing them – let alone fixing bugs in them weeks later.  The path out of this predicament lies in the tried and true (but surprisingly unpopular) technique called unit testing.

In unit testing, you write automated tests that exercise your code.  Ideally, you write the tests before you actually write the code, but any unit testing is better than none at all.

MSH:55 C:\Temp >parse-urls -unittest:$true
.................

<after breaking the script >

MSH:56 C:\Temp >parse-urls -unittest:$true
FAIL.  Expected: 1.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: 1.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: 1.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: 1.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: 1.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: 1.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: 1.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: 2.  Actual: 0.  Test failed..
FAIL.  Expected: test1.  Actual: .  Test failed..
FAIL.  Expected: test2.  Actual: .  Test failed..

Although there are no bona-fide unit testing frameworks for MSH scripts, the concept is bleedingly simple.  We can implement the basic requirements by including only a few simple functions in our script:

## A simple assert function.  Verifies that $condition
## is true.  If not, outputs the specified error message.
function assert
     (
 [bool] $condition = $(Please specify a condition),
 [string] $message = "Test failed."
     )
{
 if(-not $condition)
 {
  write-host "FAIL. $message"
 }
 else
 {
  write-host -NoNewLine ".";
 }
}

## A simple "assert equals" function.  Verifies that $expected
## is equal to $actual.  If not, outputs the specified error message.
function assertEquals
     (
 $expected = $(Please specify the expected object),
 $actual = $(Please specify the actual object),
 [string] $message = "Test failed."
     )
{
 if(-not ($expected -eq $actual))
 {
  write-host "FAIL.  Expected: $expected.  Actual: $actual.  $message."
 }
 else
 {
  write-host -NoNewLine ".";
 }
}

Now, let’s see it in practice (along with the URL parser goodies I promised):

## parse-urls.msh
## Parse all of the URLs out of a given file.

param(
    ## The filename to parse
 [string] $filename,
 
 ## The URL from which you downloaded the page.
 ## For example,
http://www.microsoft.com/index.html
 [string] $base,
 
 ## The Regular Expression pattern with which to filter
 ## the returned URLs
 [string] $pattern = ".*",
 
 ## Unit testing flag.
 [bool] $unitTest = $false  
     )         
    
## Defines the regular expression that will parse an URL
## out of an anchor tag. 
$regex = "<\s*a\s*[^>]*?href\s*=\s*[`"']*([^`"'>]+)[^>]*?>"

## The main function isn't a built-in function, but can make
## your script easier to read.  Since functions need to be defined
## before you use them, complicated scripts tend to have their function
## definitions obscure the main logic of the script.
##
## To combat this, we define a function, "main," and then call it
## (or dot-source it) at the very end of the script.
function main
{
   if(-not $unitTest)
   {
      parse-file
   }
   else
   {
      unittest
   }
}

## Parse the file for links
function parse-file
{
   if(-not $filename) { throw "Please specify a filename." }
   if(-not $base) { throw "Please specify a base URL." }

   ## Do some minimal source URL fixups, by switching backslashes to
   ## forward slashes
   $base = $base.Replace("\", "/")

   if($base.IndexOf("://") -lt 0)
   {
      throw "Please specify a base URL in the form of " +
        "
http://server/path_to_file/file.html"
   }

   ## Determine the server from which the file originated.  This will
   ## help us resolve links such as "/somefile.zip"
   $base = $base.Substring(0,$base.LastIndexOf("/") + 1)
   $baseSlash = $base.IndexOf("/", $base.IndexOf("://") + 3)
   $domain = $base.Substring(0, $baseSlash)


   ## Put all of the file content into a big string, and
   ## get the regular expression matches
   $content = [String]::Join('', (get-content $filename))
   $contentMatches = get-matches $content $regex

   foreach($contentMatch in $contentMatches)
   {
      if(-not ($contentMatch -match $pattern)) { continue }

      $contentMatch = $contentMatch.Replace("\", "/")

      ## Hrefs may look like:
      ## ./file
      ## file
      ## ../../../file
      ## /file
      ## url
      ## We'll keep all of the relative paths, as they will resolve.
      ## We only need to resolve the ones pointing to the root.
      if($contentMatch.IndexOf("://") -gt 0)
      {
         $url = $contentMatch
      }
      elseif($contentMatch[0] -eq "/")
      {
         $url = "$domain$contentMatch"
      }
      else
      {
         $url = "$base$contentMatch"
         $url = $url.Replace("/./", "/")
      }

      $url
   }
}

function get-matches
     (
 [string] $content = "",
 [string] $regex = ""
     )
{
   $returnMatches = new-object System.Collections.ArrayList

   $resultingMatches = [Regex]::Matches($content, $regex, "IgnoreCase")
   foreach($match in $resultingMatches)
   {
      [void] $returnMatches.Add($match.Groups[1].Value.Trim())
   }

   return $returnMatches  
}

function unittest
{
   ## A well-formed HREF
   $matches = get-matches "<a href=`"test1`">Test1_Text</a>" $regex
   AssertEquals 1 $matches.Count
   AssertEquals "test1" $matches[0]

   ## Case insensitive
   $matches = get-matches "<a HREF=`"test1`">Test1_Text</a>" $regex
   AssertEquals 1 $matches.Count
   AssertEquals "test1" $matches[0]

   ## Non-quoted attribute
   $matches = get-matches "<a href=test1>Test1_Text</a>" $regex
   AssertEquals 1 $matches.Count
   AssertEquals "test1" $matches[0]

   ## Unbalanced quoted attribute
   $matches = get-matches "<a href=`"test1>Test1_Text</a>" $regex
   AssertEquals 1 $matches.Count
   AssertEquals "test1" $matches[0]

   ## Single ticks for quotes
   $matches = get-matches "<a href=`'test1`'>Test1_Text</a>" $regex
   AssertEquals 1 $matches.Count
   AssertEquals "test1" $matches[0]

   ## Lots of spaces
   $matches = get-matches `
    "<a     href =    `'test1`'    >Test1_Text</a>" $regex
   AssertEquals 1 $matches.Count
   AssertEquals "test1" $matches[0]

   ## Class names
   $matches = get-matches `
    "<a class=`"test`" href =`'test1`'>Test1_Text</a>" $regex
   AssertEquals 1 $matches.Count
   AssertEquals "test1" $matches[0]

   ## Two URLs
   $matches = get-matches `
    "<a href=test1>test1</a><a href=`'test2`'>test2</a>" $regex
   AssertEquals 2 $matches.Count
   AssertEquals "test1" $matches[0]
   AssertEquals "test2" $matches[1]

   write-host
}

## A simple assert function.  Verifies that $condition
## is true.  If not, outputs the specified error message.
function assert
     (
 [bool] $condition = $(Please specify a condition),
 [string] $message = "Test failed."
     )
{
 if(-not $condition)
 {
  write-host "FAIL. $message"
 }
 else
 {
  write-host -NoNewLine ".";
 }
}

## A simple "assert equals" function.  Verifies that $expected
## is equal to $actual.  If not, outputs the specified error message.
function assertEquals
     (
 $expected = $(Please specify the expected object),
 $actual = $(Please specify the actual object),
 [string] $message = "Test failed."
     )
{
 if(-not ($expected -eq $actual))
 {
  write-host "FAIL.  Expected: $expected.  Actual: $actual.  $message."
 }
 else
 {
  write-host -NoNewLine ".";
 }
}

main

Now, here’s the great thing about unit testing.  Let’s say I find some HTML link code that this script should be able to parse, but doesn’t.  In that case, I simply write a new unit test for that code, and edit the regular expression to make the test pass.  If all of the tests continue to pass, then I can be sure that I didn’t break anything that used to work.

Now, go forth, and write high quality scripts!

Theme design by Jelle Druyts