Following on from my More on SQL Server 2005 Full Text Index Service post the other day, I thought I'd give an example of how it works

Setup


I created a table LanguageData which consisted of 2 fields liID and sValue

CREATE TABLE [dbo].[LanguageData]
(
  [ID] [int] IDENTITY(1,1) NOT NULL,
  [Value] [nvarchar](50) NOT NULL,
  CONSTRAINT [PK_LanguageData] PRIMARY KEY CLUSTERED
  (
  [ID] ASC
  ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

I entered some sample data as follows

INSERT INTO [LanguageData](Value)
SELECT 'the' UNION
SELECT 'przed' UNION
SELECT 'jakby'

where 'the' is featured in the English and Neutral language noise word files, 'przed' and 'jakby' are in the Polish language noise files. Note: You'll need to have installed the Polish full text index to make this work.

Next enable the full text indexing on the database

sp_fulltext_database 'enable'

and then create a full text catalog and an index for the table LanguageData

CREATE FULLTEXT CATALOG LanguageData AS DEFAULT
CREATE FULLTEXT INDEX ON LanguageData ([Value] LANGUAGE 1045 )
KEY INDEX [PK_LanguageData]

where 1045 indicates the language Polish - retrieved from
SELECT alias, lcid FROM Sys.syslanguages
WHERE alias = 'Polish'

Scenarios


Now, time to run some tests,

1) Check that all is initially correct, get everything

SELECT * FROM LanguageData

which returns 3 rows, as expected

2) Get everything which matches the noise word 'jakby'

SELECT * FROM LanguageData
WHERE CONTAINS(*,'jakby')

returns no rows as the word 'jakby' was stripped out at index time, and is also stripped out at query time, and a warning message "Informational: The full-text search condition contained noise word(s)."

3) Get everything which matches the noise word 'jakby' specifying Polish (1045) in the CONTAINS clause

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jakby', language 1045 )

returns no rows as the word 'jakby' was stripped out at index time, and is also stripped out at query time, and a warning message "Informational: The full-text search condition contained noise word(s)."

4) Get everything which matches the word 'jakby' specifying US English (1033) in the CONTAINS clause

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jakby', language 1033 )

returns no rows as the word 'jakby' was stripped out at index time. No warning message is displayed though as 'jakby' is not a noise word for US English

5) Get everything which matches the word 'the'

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the')

returns one row, as 'the' isn't a noise word in Polish and so wasn't stripped out at index time or at query time

6) Get everything which matches the word 'the' specifying Polish in the CONTAINS clause

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1045 )

returns one row, as 'the' isn't a noise word in Polish and so wasn't stripped out at index time or at query time

7) Get everything which matches the word 'the' specifying US English in the CONTAINS clause

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1033 )

returns no rows as 'the' is a noise word in US English and therefore is excluded at query time. A warning message "Informational: The full-text search condition contained noise word(s)." is displayed

Now to make it more interesting, lets add some data which combines noise words with normal words

INSERT INTO [LanguageData] (Value)
VALUES
('jakby przed the test')

which includes 2 polish noise words, one english noise word and one remaining word

8) Get everything which matches the word 'jakby'

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jakby')

returns no rows as the word 'jakby' was stripped out at index time, and is also stripped out at query time, and a warning message "Informational: The full-text search condition contained noise word(s)." is displayed

9) Get everything which matches the word 'the'

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the')

returns 2 rows, both the individual 'the' entry and the new 'jakby przed the test' rows. No message is displayed.

10) Get everything which matches the word 'the' using an explicit query language of Polish

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1045)

returns 2 rows, both the individual 'the' entry and the new 'jakby przed the test' rows. No message is displayed.

11) Get everything which matches the word 'the' using an explicit query language of English

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1033 )

returns no rows as the word 'the' was stripped at query time according to the noise words for 1033. A warning message "Informational: The full-text search condition contained noise word(s)." is displayed

And then to make it even more interesting, lets add a new word 'jane' to the LanguageData dataset, and to the noisewords file for the Neutral language (LCID 0) which (on my machine at least) is at C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseNEU.txt

To get the full text indexing service to pick up the changes to the noise files, you need to restart the service via the Control Panel -> Administrative Tools -> Service dialog

INSERT INTO LanguageData (Value)
VALUES ('jane')

12) Get everything which matches the word 'jane' using the implicit query language (Polish)

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jane')

which returns 1 row, as 'jane' isn't a polish noise word and wasn't stripped out at either index or query time

13) Get everything which matches the word 'jane' using the explicit query language English

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jane', language 1033 )

which returns 1 row, as 'jane' isn't a polish noise word and so wasn't stripped out an index time, neither is it an english noise word so isn't stripped out at query time either

14) Get everything which matches the word 'jane' using the explicit query language Neutral

SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jane', language 0 )

which returns 0 rows as 'jane' is a neutral noise word and so is stripped out at index time. A warning message "Informational: The full-text search condition contained noise word(s)." is displayed

Summary


What this shows, is that when you choose a language to set your full text index up as, this impacts the words which will be stripped out of the index as anything defined as noise will be removed. This has an impact on the choice of language when different language content is being indexed as we need to be clear that what is one languages noise word, isn't another ones non-noise word.
  • When querying a full text index, it is possible to specify that the query you are running is for a particular language, but if you do and if the language is different to that you set the index up as, then you'll remove 2 sets of noise words from your search - both those that were set up when the index was defined, but also those based on the language specified in the query
  • The noise files are defined on an instance by instance basis and so any alterations to the noise file will affect all full text indexes on an instance.
  • To pick up changes to the noise files, the service needs to be restarted.
  • SQL Server 2008 seems to change this and so more research will be required - it relies on STOPLISTs instead.