How to Identify Crawled Properties and Their Values

Summary

This page describes two important steps of a search project that can prove to be a bit of a challenge. The content of this article relates specifically to SharePoint 2010 and FAST Search. The first step is to identify what crawled properties you have to work with and the second step is to inspect the data stored in each crawled property. Of course, there is more than one way to do these tasks and we'll take you down a couple of options. Crawled properties are created by content source connectors during a full crawl. Therefore, you must start a full crawl as the first step to some of the options below.

Option 1 - FFD Dumper

This is a little know utility proves extremely valuable in this situation. The output is a little cryptic and you should not leave it turned on because it writes to disk for every document that goes through the system. The FFD Dumper outputs all properties of a document as it travels through the FAST Document Processing Pipeline. This includes crawled properties which you will have access to in a pipeline extensibility stage and which you may map to managed properties. These properties all have a guid at the front of the name. Any other property you see in the file is used internally to the FAST system and you do not have access to it or it's contents. See the product notes below:
<!--     The FFDDumper feature is mainly used for support and debugging purposes.
    If activated all procserver processes in the system start populating the
    directory %FASTSEARCH%\data\ffd on the same processing nodes with so called
    FFD files.
    The feature causes a considerable I/O load on the feeding nodes and consumes
    disk space in the mentioned folder on the same machines.  -->
<processor name="FFDDumper" active="yes" />

You enable this utility by editing the file %FASTSEARCH%\etc\config_data\DocumentProcessor\optionalprocessing.xml and setting the property active=yes as seen above. Then issue the following command to signify that the system should read this updated file: %FASTSEARCH%\bin\psctrl reset. You can find additional details on this xml file here.

As mentioned, the output is a little cryptic, so we'll walk through some sample output. The first number signifies the length of the property and the s# signifies the length of the value. You'll have access to all properties that start with a GUID.

4 opid i 20
5 docid s10 ssic://151
10 collection s2 sp
2 op s3 ADD
A
42 49691C90-7E17-101A-A91C-08002B2ECDA9:#9:31 s36 file://gr06/xmlcontent2/SIM/doc1.xml
12 LastModified s27 2010-12-13T18:15:24,0096472
43 B725F130-47EF-101A-A5F1-02608C9EEBAC:#15:64 s27 2010-12-03T23:48:28,5016658
44 012357BD-1113-171D-1F25-292BB0B0B0B0:#325:20 s19 6897882022068960354
43 B725F130-47EF-101A-A5F1-02608C9EEBAC:#16:64 s27 2010-12-13T18:15:24,0076894
42 0B63E350-9CCC-11D0-BCDB-00805FCCCE04:#5:31 s8 text/xml

 

Option 2 - Use a TechNet powershell utility to view all crawled properties on the search results page

This is a useful option which provides output in a readable manner that is easy to show to a business user. Read the View-AllCrawledProperties-PipelineExtensibility script documentation for additional details on this utility.

Option 3 - Develop a C# utility

If you want good visibility to what pipeline extensibility is getting, try adding a call to this function with argv[1], it will simply copy the input file to that path with a filename represented by YYYYMMddHHmmss.ffff. See this TechNet article for additional details 
 

// Write the input file to a location the application has access to write in. 
static void  WriteOutInputFile(string inputFile) 
{ 
  string localLow = Environment.GetEnvironmentVariable("USERPROFILE") + @"\appdata\LocalLow"; 
  string pipelineInputData = Path.Combine(localLow, "PipelineExtensibilityLog"); 
  Directory.CreateDirectory(pipelineInputData); 
  string outFile = Path.Combine(pipelineInputData,  
    DateTime.Now.ToString("yyyyMMddHHmmss.ffff") + ".xml");  
  File.Copy(inputFile, outFile); 
  return; 
}

References

http://msdn.microsoft.com/library/ff795801.aspx
http://gallery.technet.microsoft.com/scriptcenter/en-us/834cd7a8-4e87-4b5a-bef9-a519fd1712ba
http://msdn.microsoft.com/en-us/library/ff795826.aspx

Contributors:  Barry Waldbaum, Brent Groom