Skip to content

FAQ: How does indexing work? What are IFilters and Protocol Handlers?

by Brandon on June 20th, 2007

The Indexer 

At its core, the Windows Search indexer doesn’t really know anything about files, e-mails, or anything like that.  In fact, all it really knows is how to do the following things:

  1. Index contents and metadata associated with a URL and store it in a row.
  2. Retrieve rows that match a specific query.
  3. Shape the results in interesting ways (sorted, grouped, etc)
  4. Retrieve properties / metadata associated with a row.

The indexer relies on other Windows Search components to handle the specifics, such as converting a URL into data to be indexed.  That’s where Protocol Handlers, IFilters, and Property Handlers come in.

Protocol Handlers

Protocol Handler allows the indexer to crawl a specific kind of data store.  For example, the File System Protocol Handler allows the indexer to crawl files stored on your hard drive.  Windows Search includes a few Protocol Handlers including those for the File System, MAPI (ie. Outlook), and the Client-Side Cache for Offline Files (Vista only).  Other examples include the Protocol Handlers for Lotus Notes, the IE History / Cache, or Mozilla Thunderbird.

At a basic level, a Protocol Handler is just a piece of code that takes as input a URL (like “file://C:/Foo/” or “mapi://{USER-SID}/Brandon’s Mailbox/Inbox”) and performs two important tasks:

  1. Enumeration of child URLs (such as “file://C:/Foo/Bar/” or “file://C:/Foo/Bar/Taxes.docx”)
  2. Binding of URLs to either an IFilter, or a Stream (which can be bound to whatever IFilter is registered for its content type)

IFilters

An IFilter is responsible for taking an item such as a file (usually in the form of a Stream) and emitting the contents and properties of that item for indexing.

For example, the MS Word IFilter knows how to take the stream from a .DOC or .DOCX file and return both the contents and useful properties (like the author’s name or the date it was last modified) into the index.

Property Handlers

Property Handlers are similar to IFilters, except that they’re designed to simply return properties for items and not complex textual content.

From → WDS FAQ

14 Comments
  1. marcovanschagen permalink

    Brandon,
    Great article!

    I have been working to release DWG IFilter 2007, to enable users to search within AutoCAD DWG files. I am collecting much general support information and information in IFilters on the support site htp://www.dwgifilter.com . I am trying to post general spect, registry settings, and all how and why-info around IFilters that may be helpful to the usrs.
    Also, at that location you can sownload a free trial.

    Marco

  2. Sorry ot post this here but if you are on the WDS team you need to know that

    #1) Outlook 2007 bugs me to install WDS
    #2) Install WDS and it DOES NOT INDEX MY EMAIL

    My .PST file is not in the default place. I moved it 12+ years ago and have kept it here ever since. I my case d:\work\email\mymail.pst. I also have an archieve.pst in the same folder which is accessable from Outlook. WDS didn’t index that either.

    That’s very frustrating to be constantly badgered by Microsoft Outlook 2007 to install WDS for index my email and then have it not actually work.

  3. John Villere permalink

    My Vista search results include many files that have been deleted or renamed. So, when I attempt to archive those files to data DVD’s I get errors. Any comments?

  4. suc permalink

    Sometimes when you search a file in the start menu search you’ll get non-existent file and so when you click on it you’ll get a message similar this “searched file not found”. Please add a message like this: “file not found, do you want remove it from the index? yes/no”.
    A message “do you want remove it? yes/not” is already presents for the start menu objects, but not for windows search results. Please add it.

  5. Sheldon permalink

    WDS does not provide a formatted docx or xlsx preview. Is there a work around for this or other solution? Thanks.

  6. Shafaat Karim permalink

    Hi Brandon,

    I am trying to build a search for a website hosted on Windows 2008 Server.
    Could you provide some tips on how to build the same.

    Currently I am using Indexing Service to perform the Search by creating a Catalog.

    Any help is appreciated.

    Regards

  7. You list the first important responsibility of a protocol handler as “Enumeration of child URLs” but it’s not very clear from the documentation how we are supposed to support that.

    I’ve been trying to find out how the mapi: and oneindex: (One Note) protocols do this. One approach would be to implement IShellFolder, and they do appear to implement this. But if you call EnumObjects, they seem to return an E_NOTIMPLEMENTED. And the other way in which it occurred to me one could enumerate objects for crawling is to have and IFilter return a series of chunks of a PKEY_Search_UrlToIndexWithModificationTime property. But when I ask the one note search root for all its chunks, it just appears to return one – property called ‘ROBOTS’ with a value of ‘NOINDEX’.

    Perhaps that’s because the one note protocol doesn’t in fact enumerate its contents at all? From the docs it looks like you don’t technically need to make a custom store crawlable, as long as you’re prepared to notify WDS of every single indexable item you add explicitly.

    But in any case, it’s quite hard to work out what our options are for making a custom store crawlable. Given that this is one of only two jobs for a protocol handler, I think it’d be helpful to make the documentation a little more clear here. (And in particular, it would be really useful to have more insight into what WDS actually does. I want to write tests for my code, but if it’s not clear what WDS requires of my code, it’s hard to know what I’m supposed to test for.)

  8. Mike permalink

    I am trying to find a way to bypass the restriction that Indexing Service doesn’t index html files which contain the Robots NOINDEX meta tag.

    Since this is a mirrored copy of our production website, I’d like to fully be able to search every page on the local machine copy while preventing major web search engines from indexing the pages.

    Any help would be greatly appreciated.

  9. Techie permalink

    I am using WDS to search public folders located in Public Folders / Favorites.
    The problem is that, if I abreviate my search criteria in this way to include only part of the term I’m searching for, I must use characters near the beginning of the term. If I use characters from the last half of the term, WDS does not find a match.

    Below, I have tried to illustrate the results I get from searching on various combinations of the characters contained in the term 07BS12345

    SEARCH CRITERIA………….RESULTS
    07BS12345………………search results shows
    07BS……………………….search results shows
    07…………………………….search results shows
    12345………………………..NO RESULTS

    I don’t understand why WDS can’t seem to search on the characters in the last half of the term; only the front. I need complete search results whether or not the criteria taken from the first half or the last half of the term. I have conducted this same test using Outlook Advance Find (which was my search tool of choice until I came across WDS). Outlook Advance Find works great and does not care which portion of the term I use as a search critieria.

    I need to know that I’m getting complete search results even if I fail to use the entire term in the criteria. I really appreciate the help because I want WDS to work for me. Let me know if there are any other details that you would like to know.

  10. Hi there,

    A couple things to understand:
    Windows Search is designed for word-based search, not character-based matching. Character-based indexing may make sense in some very advanced user cases, but is wholly wrong for the scenarios Windows Search is trying to solve.

    If the user types “he” and we return every search result with the word “the” – the results will be very irrelevant. Such matching would drive users mad with the number of search results that come back for no apparent reason.

    The only wildcard support that exists for document content is prefix matching. Windows Search entrypoints, including Outlook 2007, default to prefix matching. That is, a query for foo is really a query for foo* – which means words like foobar will be returned, but barfoo will not.

    For properties (like the title, filename, author, tags, etc), there are many more wildcard options supported. For example, you can specify a suffix based query by typing *foo into the search box (which really becomes *foo*). However, the index currently does not support these for document content. So while *foo will find a document named Barfoo.docx, it will not find a file named Bar.docx that contains the word “barfoo.”

    Hope that helps,
    Brandon

  11. Alex Grigoriev permalink

    Here is my 2 bit:

    1. Installed WDS 4 on XP. Immediately noticed that the system becomes extremely slow at times (games become jerky), starts thrashing the harddrive and the indexer CPU usage goes up to 50%. Should not indexing be I/O bound, rather than CPU bound? I have single CPU, just like so many people around.

    2. Tried to search for some string in a folder. Got a helpful response: “this directory is not indexed yet”. Should an user give a damn about that? Just do full file search without such stupid excuses. You can also gather indexing information during that; but JUST DO SEARCH ANYWAY. Same as if an user wants to search something which is never indexed. What if an user wants to get results for a just updated file? Which is not reindexed yet? Will you return stale results?

    Given that, I just uninstalled WDS 4. Not ready for prime time, sorry.

  12. Alex Grigoriev permalink

    Brandon,

    Does the filter host actually run under Local System (SYSTEM), as the documentation states? Doing that would be extremely stupid idea, even though the process runs under restricted token. This is what Local Service account is for.

Trackbacks & Pingbacks

  1. Brandon Paddock's Blog - Desktop Search and more » Blog Archive » FAQ: Why does WDS / Windows Vista use so many processes?
  2. Jengates Blog » Blog Archive » links for 2007-06-28

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS