Aquileo | Recent changes to wiki

Aquileo | Discussion for Home page

2022-07-06T19:22:20.831000Z

Hi
Is there any development ?
Best regards

Aquileo | charset detection modified by YaN

2013-04-16T19:51:36.218000Z

--- v3
+++ v4
@@ -9,7 +9,7 @@
 *the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped.
 *if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language and this encoding is used.

-After loop evaluation the encding with higest confidence is reported. 
+After loop evaluation the encoding with higest confidence is reported. 

 #Implementations
 ##Mozilla charset detector

Aquileo | charset detection modified by YaN

2013-04-16T19:14:12.407000Z

--- v2
+++ v3
@@ -7,8 +7,22 @@
 Assumes that we have an array of bytes as input. For each detectable encoding the following procedure will be executed:
 *we assumes that bytes are representing a text in this encoding.
 *the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped.
-*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language is used.
+*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language and this encoding is used.

 After loop evaluation the encding with higest confidence is reported. 

 #Implementations
+##Mozilla charset detector
+##enca
+##MS Windows API
+##ICU charset detector
+##Lazarus charset detector
+
+Comparsion
+
+
+
+
+
+
+Encoding Mozilla enca Win API ICU Lazarus
UTF-7 + + + + +
UTF-8 + + + + +
UTF-16LE + + + + +

Aquileo | charset detection modified by YaN

2013-04-16T18:52:54.993000Z

--- v1
+++ v2
@@ -4,15 +4,11 @@

 #Algorithm
-Several different techniques are used for character set detection.
+Assumes that we have an array of bytes as input. For each detectable encoding the following procedure will be executed:
+*we assumes that bytes are representing a text in this encoding.
+*the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped.
+*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language is used.

-For multi-byte encodings, the sequence of bytes is checked for legal patterns. The detected characters are also check against a list of frequently used characters in that encoding. 
-
-For single byte encodings, the data is checked against a list of the most commonly occurring three letter groups for each language that can be written using that encoding. 
-
-The detection process can be configured to optionally ignore html or xml style markup, which can interfere with the detection process by changing the statistics.
-
-Because of this, detection works best if you supply at least a few hundred bytes of character data that's mostly in a single language. 
-
+After loop evaluation the encding with higest confidence is reported. 

 #Implementations

Aquileo | Home modified by YaN

2013-04-16T18:10:31.139000Z

--- v6
+++ v7
@@ -6,4 +6,5 @@
 The last version of Charset Detector can be downloaded here [[download_button]]

+------------------------
 [[project_admins]]

Aquileo | Home modified by YaN

2013-04-16T18:10:10.780000Z

--- v5
+++ v6
@@ -3,8 +3,7 @@
 Here you can find some description about [charset detection] process.

-You can download last version of Charset Detector here
-[[download_button]]
+The last version of Charset Detector can be downloaded here [[download_button]]

 [[project_admins]]

Aquileo | charset detection modified by YaN

2013-04-16T18:08:50.705000Z

Overview
Algorithm
Implementations

Overview

Character set detection is the process of determining the character set, or encoding, of character data in an unknown format. This is, at best, an imprecise operation using statistics and heuristics. In some cases, the language can be determined along with the encoding.

Algorithm

Several different techniques are used for character set detection.

For multi-byte encodings, the sequence of bytes is checked for legal patterns. The detected characters are also check against a list of frequently used characters in that encoding.

For single byte encodings, the data is checked against a list of the most commonly occurring three letter groups for each language that can be written using that encoding.

The detection process can be configured to optionally ignore html or xml style markup, which can interfere with the detection process by changing the statistics.

Because of this, detection works best if you supply at least a few hundred bytes of character data that's mostly in a single language.

Implementations

Aquileo | Home modified by YaN

2013-04-16T18:01:20.810000Z

--- v4
+++ v5
@@ -1,10 +1,10 @@
 Welcome to Charset Detector wiki!
+
+Here you can find some description about [charset detection] process.
+

 You can download last version of Charset Detector here
 [[download_button]]

-Here you can find some description about [charset detection] process.
-
-
 [[project_admins]]

Aquileo | Home modified by YaN

2013-04-16T18:00:46.240000Z

--- v3
+++ v4
@@ -4,7 +4,6 @@
 [[download_button]]

-
 Here you can find some description about [charset detection] process.

Aquileo | Home modified by YaN

2013-04-16T17:59:05.487000Z

--- v2
+++ v3
@@ -3,6 +3,8 @@
 You can download last version of Charset Detector here
 [[download_button]]

+
+
 Here you can find some description about [charset detection] process.

Encoding	Mozilla	enca	Win API	ICU	Lazarus
UTF-7	+	+	+	+	+
UTF-8	+	+	+	+	+
UTF-16LE	+	+	+	+	+