HTML Parsing by ASP.NET XML Web Services
The Web today exposes an immense quantity of information. Unfortunately, the majority of this data is easily interpreted only by human eyes reading it from a browser. Web services created using ASP.NET help improve this situation by providing an HTML parsing solution that enables developers to parse content from a remote HTML page and programmatically expose the resulting data. Once permission is obtained from the publisher of the Web site content, and assuming the layout of the content does not change, HTML parsing can then be used to expose Web services that clients can leverage. For more information about HTML parsing, see How to: Create Web Services That Parse the Contents of a Web Page.
Building a Web service that parses the contents of a Web page uses a different model than building a typical Web service. A Web service that parses an HTML page is implemented through the creation of a service description, which is an XML document in the Web Services Description Language (WSDL). Within the service description, XML elements are added to specify both the input parameters and the data to return from the parsed HTML page.
Input parameters can be passed to the Web server if the HTML page being parsed accepts parameters that affect the contents of the returned HTML page.
Specifying the data returned from the parsed HTML page is where the majority of the implementation is done, as that is where the instructions to parse the HTML content are specified. In order to add these XML elements and thus build a Web service that parses an HTML page, a developer must have an understanding of the layout of an XML document written in WSDL. For details on WSDL, see the WSDL specification, at the W3C Web site (www.w3.org/TR/wsdl).
The data to return from a parsed HTML page is expressed within the service description using a series of XML elements that contain regular expressions to parse specific pieces of data while providing a name for each piece of data. The actual .NET Framework regular expression appears in a match XML element. Regular expressions provide an extensive pattern-matching notation that allows you to quickly parse large amounts of text to find specific character patterns. For details regarding the .NET Framework regular expression syntax, see .NET Framework Regular Expressions.
The <match> Element
The match element can be specified with the following attributes:
Attribute | Description |
---|---|
name |
The class or property name that represents the returned piece of data. A proxy class generated by the Wsdl.exe tool associates the name attribute with a class, if the match XML element has child match elements. The child match elements are mapped to properties of the class. |
Pattern |
The regular expression pattern to use in order to obtain the piece of data. For details regarding the .NET Framework regular expression syntax, see .NET Framework Regular Expressions. |
ignoreCase |
Specifies whether the regular expression should be run case-insensitive. The default is case-sensitive. |
Repeats |
Specifies the number of values that should be returned from the regular expression, in case the regular expression has multiple matches on the HTML page. A value of 1 returns only the first match. A value of -1 returns all matches. A value of -1 equates to a * in a regular expression. The default value is -1. |
Group |
Specifies a grouping of related matches. |
Capture |
Specifies the index of a match within a grouping. |
type |
Proxy classes generated using Wsdl.exe use the type attribute as the name of the returned class for a match that contains child match elements. By default, a proxy class generated by Wsdl.exe sets the name of the returned class to the name specified in the name attribute. |
See Also
Tasks
How to: Create Web Services That Parse the Contents of a Web Page
Reference
MatchAttribute Class
Web Services Description Language Tool (Wsdl.exe)
Other Resources
.NET Framework Regular Expressions
XML Web Services Using ASP.NET