Tuesday, August 26, 2008
Localisation vs Internationalisation
A great car manufacturing analogy from
Beyond Borders: web globalization strategies has really helped me get the difference between Localisation and Internationalisation clear in my head:
When a car is designed and built, it is designed to be as appropriate for international markets as possible, so items like the steering wheel can be positioned at the right or left hand side of the car. So, Internationalisation is the behind the scenes work to make the item configurable and customisable as appropriate for a specific location.
When you buy a car, you generally buy a car that has been localised for your country - so here in the UK I would buy a right hand drive car, whilst my cousins in the US would buy a left hand drive car.
Labels: localisation, localization
// posted by Jane @ 3:09 PM
Comments:
Monday, July 28, 2008
Full Text Indexing - the impact of index time and query time language choice
Following on from my
More on SQL Server 2005 Full Text Index Service post the other day, I thought I'd give an example of how it works
Setup
I created a table LanguageData which consisted of 2 fields liID and sValue
CREATE TABLE [dbo].[LanguageData]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Value] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_LanguageData] PRIMARY KEY CLUSTERED
(
[ID] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]I entered some sample data as follows
INSERT INTO [LanguageData](Value)
SELECT 'the' UNION
SELECT 'przed' UNION
SELECT 'jakby' where 'the' is featured in the English and Neutral language noise word files, 'przed' and 'jakby' are in the Polish language noise files. Note: You'll need to have
installed the Polish full text index to make this work.
Next enable the full text indexing on the database
sp_fulltext_database 'enable'and then create a full text catalog and an index for the table LanguageData
CREATE FULLTEXT CATALOG LanguageData AS DEFAULT
CREATE FULLTEXT INDEX ON LanguageData ([Value] LANGUAGE 1045 )
KEY INDEX [PK_LanguageData]where 1045 indicates the language Polish - retrieved from
SELECT alias, lcid FROM Sys.syslanguages
WHERE alias = 'Polish'Scenarios
Now, time to run some tests,
1) Check that all is initially correct, get everything
SELECT * FROM LanguageDatawhich returns 3 rows, as expected
2) Get everything which matches the noise word 'jakby'
SELECT * FROM LanguageData
WHERE CONTAINS(*,'jakby')returns no rows as the word 'jakby' was stripped out at index time, and is also stripped out at query time, and a warning message "Informational: The full-text search condition contained noise word(s)."
3) Get everything which matches the noise word 'jakby' specifying Polish (1045) in the CONTAINS clause
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jakby', language 1045 )returns no rows as the word 'jakby' was stripped out at index time, and is also stripped out at query time, and a warning message "Informational: The full-text search condition contained noise word(s)."
4) Get everything which matches the word 'jakby' specifying US English (1033) in the CONTAINS clause
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jakby', language 1033 )returns no rows as the word 'jakby' was stripped out at index time. No warning message is displayed though as 'jakby' is not a noise word for US English
5) Get everything which matches the word 'the'
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the')returns one row, as 'the' isn't a noise word in Polish and so wasn't stripped out at index time or at query time
6) Get everything which matches the word 'the' specifying Polish in the CONTAINS clause
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1045 )returns one row, as 'the' isn't a noise word in Polish and so wasn't stripped out at index time or at query time
7) Get everything which matches the word 'the' specifying US English in the CONTAINS clause
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1033 )returns no rows as 'the' is a noise word in US English and therefore is excluded at query time. A warning message "Informational: The full-text search condition contained noise word(s)." is displayed
Now to make it more interesting, lets add some data which combines noise words with normal words
INSERT INTO [LanguageData] (Value)
VALUES
('jakby przed the test')which includes 2 polish noise words, one english noise word and one remaining word
8) Get everything which matches the word 'jakby'
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jakby')returns no rows as the word 'jakby' was stripped out at index time, and is also stripped out at query time, and a warning message "Informational: The full-text search condition contained noise word(s)." is displayed
9) Get everything which matches the word 'the'
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the')returns 2 rows, both the individual 'the' entry and the new 'jakby przed the test' rows. No message is displayed.
10) Get everything which matches the word 'the' using an explicit query language of Polish
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1045)returns 2 rows, both the individual 'the' entry and the new 'jakby przed the test' rows. No message is displayed.
11) Get everything which matches the word 'the' using an explicit query language of English
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'the', language 1033 )returns no rows as the word 'the' was stripped at query time according to the noise words for 1033. A warning message "Informational: The full-text search condition contained noise word(s)." is displayed
And then to make it even more interesting, lets add a new word 'jane' to the LanguageData dataset, and to the noisewords file for the Neutral language (LCID 0) which (on my machine at least) is at C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseNEU.txt
To get the full text indexing service to pick up the changes to the noise files, you need to restart the service via the Control Panel -> Administrative Tools -> Service dialog
INSERT INTO LanguageData (Value)
VALUES ('jane')
12) Get everything which matches the word 'jane' using the implicit query language (Polish)
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jane')
which returns 1 row, as 'jane' isn't a polish noise word and wasn't stripped out at either index or query time
13) Get everything which matches the word 'jane' using the explicit query language English
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jane', language 1033 )
which returns 1 row, as 'jane' isn't a polish noise word and so wasn't stripped out an index time, neither is it an english noise word so isn't stripped out at query time either
14) Get everything which matches the word 'jane' using the explicit query language Neutral
SELECT * FROM LanguageData
WHERE CONTAINS(*, 'jane', language 0 )
which returns 0 rows as 'jane' is a neutral noise word and so is stripped out at index time. A warning message "Informational: The full-text search condition contained noise word(s)." is displayed
Summary
What this shows, is that when you choose a language to set your full text index up as, this impacts the words which will be stripped out of the index as anything defined as noise will be removed. This has an impact on the choice of language when different language content is being indexed as we need to be clear that what is one languages noise word, isn't another ones non-noise word. - When querying a full text index, it is possible to specify that the query you are running is for a particular language, but if you do and if the language is different to that you set the index up as, then you'll remove 2 sets of noise words from your search - both those that were set up when the index was defined, but also those based on the language specified in the query
- The noise files are defined on an instance by instance basis and so any alterations to the noise file will affect all full text indexes on an instance.
- To pick up changes to the noise files, the service needs to be restarted.
- SQL Server 2008 seems to change this and so more research will be required - it relies on STOPLISTs instead.
Labels: full text index, localisation, localization, SQLServer2005
// posted by Jane @ 4:30 PM
Comments:
Thursday, July 24, 2008
More on SQL Server 2005 Full Text Index Service
In my previous post about
How to work out which are valid full text languages on a SQL Server 2005 instance I referred to
sys.syslanguages and
sys.fulltext_languages in my queries, but didn't really say much more about them, so here goes
sys.syslanguages
In the
definition on MSDN it states
"Contains one row for each language present in the instance of SQL Server 2005. Although U.S. English is not in syslanguages, it is always available to SQL Server."
And one thing on the choice of U.S. English vs UK English. The
SQL Server Full Text Search: Language Features says
"In actual fact UK English does not refer to the Queen's English or the English used in the United Kingdom, but International English; the English that is used in all other English speaking countries other than US English."
As an English person, living in England and speaking English I find this a somewhat grating use of the phrase UK English. Bah!
sys.fulltext_languages
In the
definition on MSDN it states
"This catalog view contains one row per language available for full-text indexing/querying operations. Each row provides an unambiguous representation of the available full-text linguistic resources that are registered with Microsoft SQL Server. The name or lcid can be specified in the full-text queries and full-text index DDL."
The list in this table, doesn't match those in sys.syslanguages. These are purely the full-text-indexable languages. As I mentioned in my
previous post 6 languages can be added by following
these instructions. The line
"The name or lcid can be specified in the full-text queries and full-text index DDL."
refers to the ability to issue the following SQL:
SELECT *
FROM LanguageData
WHERE CONTAINS(*, 'the', language 1045 ) which indicates that the locale used for querying should be 1045, which equates to Polish. I have some sample SQL to post in the next few days which demonstrates the difference between indexing and querying language choices.
In General
I've been doing quite a bit of work with trying to understand how the SQL Server 2005 full text index works, and how the language choice impacts it. My knowledge of full text indexing as a whole to this stage hasn't been great, so I've done quite a lot of background reading. Amongst the best resources I've found are:
both by
Hillary Cotter which provide a really simple, but yet pretty comprehensive introduction to the various features of indexing and querying using the Full Text Index service.
Labels: full text index, localisation, localization, SQLServer2005
// posted by Jane @ 8:02 PM
Comments:
How to work out which are valid full text languages on a SQL Server 2005 instance
Despite SQL Server 2005 supporting 33 languages (found by issuing
SELECT * FROM sys.syslanguages), not all of these are available for the full text index service. To find out which ones are run the query:
SELECT *
FROM sys.fulltext_languages On my machine, this returns the following languages:
- British English
- Chinese (Hong Kong SAR, PRC)
- Chinese (Macau SAR)
- Chinese (Singapore)
- Simplified Chinese
- Traditional Chinese
- Dutch
- English
- French
- German
- Italian
- Japanese
- Korean
- Neutral
- Spanish
- Swedish
- Thai
An additional
6 languages are supported and available for a separate install. These are :
- Danish
- Polish
- Português (Brasil)
- Portuguese
- Russian
- Turkish
To install these, follow the instructions
here.
The following languages are not supported for full text searching at all within SQL Server 2005:
- Arabic
- Bulgarian
- Croatian
- Czech
- Estonian
- Finnish
- Greek
- Hungarian
- Latvian
- Lithuanian
- Norwegian
- Romanian
- Slovak
- Slovenian
SQL Server 2008 offers
more full text language support bringing the total of available languages to 50. It would appear that Danish, Polish and Turkish remain
installable additions.
Labels: full text index, localisation, localization, SQLServer2005
// posted by Jane @ 7:36 PM
Comments:
Wednesday, July 23, 2008
Localisation, Javascript and extended character sets in Visual Studio 2005
I'm currently doing some work on looking into producing a localised version of a
Madgex job board (not dis-similar to the
work I did this time last year)and am mainly looking at the SQL Server and javascript areas whilst a colleague looks at the .NET side.
Glenn gave me a tip off that when he'd been doing something similar, he'd had problems with Visual Studio 2005 not saving his javascript files as UTF.
So, within VS2005 I created a javascript file and put 2 lines into it. They were simply:
alert ('hello world');
alert('Zarys gramatyki por¢wnawczej jezyk¢w slowianskich');I then linked this into a (very) basic HTML page
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>i18n</title>
<script type="text/javascript" src="js/i18n.js" language="javascript"></script>
</head>
<body>
</body>
</html>so that on pageload 2 alert boxes are displayed, one saying 'Hello World' and the one saying 'Zarys gramatyki por¢wnawczej jezyk¢w slowianskich'.
Unfortunately what is displayed instead is:

which isn't exactly what I had in mind.
I opened the file in
Notepad++ (my text editor of choice) to take a look at the file type and it is, as I'd expected, saved as ANSI, not UTF-8 or UTF-16

I used Notepad++'s menu item Format -> Convert to UTF-8 to convert this file from ANSI into UTF-8, and then re-ran my test and all works correctly as expected. Hurrah!
I then repeated this using VS2008 and found that this is one of the fixes over VS2005.
So, the alert now correctly displays:

and when opened in Notepad++ the file is now, correctly, UTF-8.

Labels: javascript, localisation, localization, Visual Studio
// posted by Jane @ 3:43 PM
Comments:
Tuesday, June 26, 2007
Localisation
The current project I'm working on is going to be delivered in Dutch. I figured I had 2 choices, the hard coded method or a resource based method. The hard coded method would be ok, but as the whole of the site is being presented in Dutch my fear was that it would make testing, and maintenance incredible hard (for the record I don't speak Dutch!). The resource based method might take slightly longer to implement, but not enough to prevent me from doing it.
So, I'm using the .NET
ResourceManager class to manage my resources files and using
resgen to take a text file of name value pairs and generate the .resources file.
My naming strategy for the text items are XXX_YYY_Key where:
XXX is the main area of the project - i.e. admin, front end, etc
YYY is the module of the area - i.e. article, menu, etc
Key is the "thing" we're providing text for - i.e. title, contact name, email address, etc
so, ADMIN_ARTICLE_Title = Title etc
I have 2 files, one in English, one in Dutch and I use an
appSetting in my web.config file to specify what my DefaultCulture should be. The site will only run in either Dutch or English so there is no need to make anything more complicated that this, and it allows development and maintenance to be affected in English, whilst the live site is presented in Dutch.
This is all working well, but then I got to thinking about the javascript and drew a blank through my google searching. Some of my javascript files perform validation, and alert the user with appropriate messages if the content isn't valid. I asked some of my
local brains and still came up with a blank, a few names were banded about, but I didn't get to HackDay and so didn't get to pick their brains.
At the moment, I've added a line to my resources file to contain the javascript file to include in the file. This allows there to be muliple validation files, one per language if I want - at the moment there is just one file and it's in Dutch. But really this isn't a great solution. I've started doing some
investigation and playing, but haven't really had the time to progress it any further.
The current 4 options to do this that I can think of are:
- Basic - 2 javascript files, one per language - duplicate functionality
- Literals - 3 javascript files - one per language, plus one with functionality
- A self detecting piece of javascript that works out the locale from some setting and thus includes the appropriate literals file (or other data store method) dynamically
- A method using a better form of data store than variables
Anyone done any localisation/localization in javascript and got any more options or solutions for me?
Labels: ASP.NET, javascript, localisation, localization, resources
// posted by Jane @ 7:47 PM
Comments: