How to: Create Web Services That Parse the Contents of a Web PageĀ 

Web services created using ASP.NET provide an HTML parsing solution that enables developers to parse content from a remote HTML page and programmatically expose the resulting data. For a detailed explanation, see HTML Parsing by ASP.NET XML Web Services.

To specify an operation and input parameters

  1. Create a Web Services Description Language (WSDL) document, which is typically saved with the file name extension .wsdl. The document's content must consist of valid XML according to the WSDL schema. For a prototype, you can use a WSDL document dynamically generated for a Web service running on ASP.NET. Make a request with a ?wsdl argument appended to the Web service URL.

  2. Specify the elements that define the operation each Web service method that parses HTML text. This step and the next one require a knowledge of the WSDL format.

  3. If the parsing method takes input parameters, specify the elements that represent those parameters and associate them with the operation.

To specify the data returned from a parsed HTML page

  1. Add a namespace-qualified <text> XML element within the <output> element that appears via the XPath /definitions/binding/operation/output. The <operation> element represents the Web service method that retrieves parsed HTML.

  2. Add <match> XML elements in the service description within the <text> XML element for each piece of data you want to return from the parsed HTML page.

  3. Apply attributes to the <match> element. The valid attributes are presented in a table under the topic HTML Parsing by ASP.NET XML Web Services.

To generate client proxy code for the Web service

  • Run the Wsdl.exe tool from the .NET Framework SDK. Pass the WSDL file you created as an input.

Example

The following code example is a simple Web page sample containing <TITLE> and <H1> tags.

<HTML>
 <HEAD>
  <TITLE>Sample Title</TITLE>
 </HEAD>
 <BODY>
    <H1>Some Heading Text</H1>
 </BODY>
</HTML>

The following code example is a service description that parses the contents of the HTML page, extracting the contents of the text within the <TITLE> and <H1> tags. In the code example, a TestHeaders method is defined for the GetTitleHttpGet binding. The TestHeaders method defines two pieces of data that can be returned from the parsed HTML page in <match> XML elements: Title and H1, which parse the contents of the <TITLE> and <H1> tags, respectively.

<?xml version="1.0"?>
<definitions xmlns:s="https://www.w3.org/2001/XMLSchema"
             xmlns:http="https://schemas.xmlsoap.org/wsdl/http/"
             xmlns:mime="https://schemas.xmlsoap.org/wsdl/mime/"
             xmlns:soapenc="https://schemas.xmlsoap.org/soap/encoding/"
             xmlns:soap="https://schemas.xmlsoap.org/wsdl/soap/"
             xmlns:s0="https://tempuri.org/"
             targetNamespace="https://tempuri.org/"
             xmlns="https://schemas.xmlsoap.org/wsdl/">
  <types>
    <s:schema targetNamespace="https://tempuri.org/"
              attributeFormDefault="qualified"
              elementFormDefault="qualified">
      <s:element name="TestHeaders">
        <s:complexType derivedBy="restriction"/>
      </s:element>
      <s:element name="TestHeadersResult">
        <s:complexType derivedBy="restriction">
          <s:all>
            <s:element name="result" type="s:string" nullable="true"/>
          </s:all>
        </s:complexType>
      </s:element>
      <s:element name="string" type="s:string" nullable="true"/>
    </s:schema>
  </types>
  <message name="TestHeadersHttpGetIn"/>
  <message name="TestHeadersHttpGetOut">
    <part name="Body" element="s0:string"/>
  </message>
  <portType name="GetTitleHttpGet">
    <operation name="TestHeaders">
      <input message="s0:TestHeadersHttpGetIn"/>
      <output message="s0:TestHeadersHttpGetOut"/>
    </operation>
  </portType>
  <binding name="GetTitleHttpGet" type="s0:GetTitleHttpGet">
    <https:binding verb="GET"/>
    <operation name="TestHeaders">
      <https:operation location="MatchServer.html"/>
      <input>
        <https:urlEncoded/>
      </input>
      <output>
         <text xmlns="https://microsoft.com/wsdl/mime/textMatching/">
          <match name='Title' pattern='TITLE&gt;(.*?)&lt;'/>
          <match name='H1' pattern='H1&gt;(.*?)&lt;'/>
         </text>
      </output>
    </operation>
  </binding>
  <service name="GetTitle">
    <port name="GetTitleHttpGet" binding="s0:GetTitleHttpGet">
      <https:address location="https://localhost" />
    </port>
  </service>
</definitions>

The following code example is a portion of the proxy class generated by Wsdl.exe for the previous service description.

' GetTitle is the name of the proxy class.
Public Class GetTitle
  Inherits HttpGetClientProtocol
  Public Function TestHeaders() As TestHeadersMatches
     Return CType(Me.Invoke("TestHeaders", (Me.Url + _
          "/MatchServer.html"), New Object(-1) {}),TestHeadersMatches)
  End Function
End Class
Public Class TestHeadersMatches
    Public Title As String
    Public H1 As String
End Class
' GetTitle is the name of the proxy class.
public class GetTitle : HttpGetClientProtocol
{
  public TestHeadersMatches TestHeaders() 
  {
        return ((TestHeadersMatches)(this.Invoke("TestHeaders", 
                 (this.Url + "/MatchServer.html"), new object[0])));
  }
}    
public class TestHeadersMatches 
{
    public string Title;
    public string H1;
}

See Also

Reference

Web Services Description Language Tool (Wsdl.exe)
MatchAttribute

Concepts

HTML Parsing by ASP.NET XML Web Services

Other Resources

.NET Framework Regular Expressions
XML Web Services Using ASP.NET