The Indexer
At its core, the Windows Search indexer doesn’t really know anything about files, e-mails, or anything like that. In fact, all it really knows is how to do the following things:
The indexer relies on other Windows Search components to handle the specifics, such as converting a URL into data to be indexed. That’s where Protocol Handlers, IFilters, and Property Handlers come in.
Protocol Handlers
A Protocol Handler allows the indexer to crawl a specific kind of data store. For example, the File System Protocol Handler allows the indexer to crawl files stored on your hard drive. Windows Search includes a few Protocol Handlers including those for the File System, MAPI (ie. Outlook), and the Client-Side Cache for Offline Files (Vista only). Other examples include the Protocol Handlers for Lotus Notes, the IE History / Cache, or Mozilla Thunderbird.
At a basic level, a Protocol Handler is just a piece of code that takes as input a URL (like “file://C:/Foo/” or “mapi://{USER-SID}/Brandon’s Mailbox/Inbox”) and performs two important tasks:
IFilters
An IFilter is responsible for taking an item such as a file (usually in the form of a Stream) and emitting the contents and properties of that item for indexing.
For example, the MS Word IFilter knows how to take the stream from a .DOC or .DOCX file and return both the contents and useful properties (like the author’s name or the date it was last modified) into the index.
Property Handlers
Property Handlers are similar to IFilters, except that they’re designed to simply return properties for items and not complex textual content.
The three processes used by the Windows Search service are SearchIndexer.exe, SearchProtocolHost.exe, and SearchFilterHost.exe. Sometimes you may even see multiple instances of the latter two running simultaneously (especially if multiple users are logged in).
So why are they divided up in this way? To find out, let’s look at what each of the processes does.
This process runs as a system service under the SYSTEM account. It is responsible for maintaining the index, servicing queries, as well as deciding what to crawl and when.
This process sometimes runs under the SYSTEM account, and other times runs in the context of the current user. It hosts a Protocol Handler responsible for enumerating items in a specific store (such as the File System, Outlook, UNC shares, Lotus, etc).
Why is it seperate?
Access - Sometimes it needs to run in the context of the SYSTEM account (ie. to index the filesystem, even when a user is not logged in). Other times it needs to run in the context of the user, so that it can access data that is ACL’d for that user (network shares, Offline files) or accessed via a program the user is running (Outlook, Thunderbird).
Reliability - If a protocol handler, which may be written by a third-party, crashes - it will not crash the indexer itself. This reduces the risk of index corruption, and ensures that you can still issue queries even if a protocol handler crashes or hangs.
Security - Isolating code that interacts with possibly untrusted data stores can mitigate vulnerabilities in said code.
This process hosts the actual IFilters. These filters are responsible for processing individual items, such as files, in a data store.
Why is it seperate?
Security - This process is tightly locked down. For example, it cannot even read the filesystem. It runs with reduced privileges (kind of like Protected Mode IE). Why is this important? Well think back to the WMF file vulnerability a year or so ago. Google Desktop Search would trigger the vulnerability whenever it indexed one of those such files. If you received it as an e-mail attachment, you would have a 0-click attack because they don’t sandbox the indexing process. This wasn’t a problem for WDS users because we have always isolated filtering to a seperate locked-down process.
Reliability - Same as with the Protocol Handlers. IFilters are very often third-party code, and may be subjected to corrupted files. Keeping them seperated improves robustness to crashes / hangs in third-party code or when dealing with corrupted data.
[powered by WordPress.]
Hi. I'm Brandon. I'm a geek, and I work on Search technology for Windows at Microsoft. This is my blog.
The views expressed within my blog are my own - and are not in any way indicative of those of the company I work for, Microsoft, or it's employees. No warranties or other guarantees will be offered as to the quality of the opinions or anything else offered here.