Aquileo | Recent changes to wikihttps://sourceforge.net/p/chsdet/wiki/2022-07-06T19:22:20.831000ZRecent changes to wikiAquileo | Discussion for Home page2022-07-06T19:22:20.831000Z2022-07-06T19:22:20.831000ZYannick LANCHEChttps://sourceforge.net/u/ylanchec/https://sourceforge.net07b7fb7b94ceade7b0b099a4e940d76819be7bbe<div class="markdown_content"><p>Hi <br/> Is there any development ?<br/> Best regards</p></div>Aquileo | charset detection modified by YaN2013-04-16T19:51:36.218000Z2013-04-16T19:51:36.218000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.net0a22015bda122c2d2549275f43e6133ccd98b3ea<div class="markdown_content"><pre>--- v3 +++ v4 @@ -9,7 +9,7 @@ *the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped. *if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language and this encoding is used. -After loop evaluation the encding with higest confidence is reported. +After loop evaluation the encoding with higest confidence is reported. #Implementations ##Mozilla charset detector </pre> </div>Aquileo | charset detection modified by YaN2013-04-16T19:14:12.407000Z2013-04-16T19:14:12.407000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.net876d097d8816c3a81d2664c24e195ce9b67aca97<div class="markdown_content"><pre>--- v2 +++ v3 @@ -7,8 +7,22 @@ Assumes that we have an array of bytes as input. For each detectable encoding the following procedure will be executed: *we assumes that bytes are representing a text in this encoding. *the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped. -*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language is used. +*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language and this encoding is used. After loop evaluation the encding with higest confidence is reported. #Implementations +##Mozilla charset detector +##enca +##MS Windows API +##ICU charset detector +##Lazarus charset detector + +Comparsion +<table> +<tr><td>Encoding</td><td>Mozilla</td><td>enca</td><td>Win API</td><td>ICU</td><td>Lazarus</td></tr> +<tr><td>UTF-7</td><td>+</td><td>+</td><td>+</td><td>+</td><td>+</td></tr> +<tr><td>UTF-8</td><td>+</td><td>+</td><td>+</td><td>+</td><td>+</td></tr> +<tr><td>UTF-16LE</td><td>+</td><td>+</td><td>+</td><td>+</td><td>+</td></tr> + +</table> </pre> </div>Aquileo | charset detection modified by YaN2013-04-16T18:52:54.993000Z2013-04-16T18:52:54.993000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.net56cd02aa35cafd25a4c9a3190037d01b082d47e4<div class="markdown_content"><pre>--- v1 +++ v2 @@ -4,15 +4,11 @@ #Algorithm -Several different techniques are used for character set detection. +Assumes that we have an array of bytes as input. For each detectable encoding the following procedure will be executed: +*we assumes that bytes are representing a text in this encoding. +*the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped. +*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language is used. -For multi-byte encodings, the sequence of bytes is checked for legal patterns. The detected characters are also check against a list of frequently used characters in that encoding. - -For single byte encodings, the data is checked against a list of the most commonly occurring three letter groups for each language that can be written using that encoding. - -The detection process can be configured to optionally ignore html or xml style markup, which can interfere with the detection process by changing the statistics. - -Because of this, detection works best if you supply at least a few hundred bytes of character data that's mostly in a single language. - +After loop evaluation the encding with higest confidence is reported. #Implementations </pre> </div>Aquileo | Home modified by YaN2013-04-16T18:10:31.139000Z2013-04-16T18:10:31.139000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.netf7b60c031950057796001884bbd701cb4e9277f5<div class="markdown_content"><pre>--- v6 +++ v7 @@ -6,4 +6,5 @@ The last version of Charset Detector can be downloaded here [[download_button]] +------------------------ [[project_admins]] </pre> </div>Aquileo | Home modified by YaN2013-04-16T18:10:10.780000Z2013-04-16T18:10:10.780000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.netc6142c36e40c69411dfeea502474aa96ff020b7c<div class="markdown_content"><pre>--- v5 +++ v6 @@ -3,8 +3,7 @@ Here you can find some description about [charset detection] process. -You can download last version of Charset Detector here -[[download_button]] +The last version of Charset Detector can be downloaded here [[download_button]] [[project_admins]] </pre> </div>Aquileo | charset detection modified by YaN2013-04-16T18:08:50.705000Z2013-04-16T18:08:50.705000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.net9645486beab14081a9ee7f6e5075a42262323917<div class="markdown_content"><div class="toc"> <ul> <li><a href="#overview">Overview</a></li> <li><a href="#algorithm">Algorithm</a></li> <li><a href="#implementations">Implementations</a></li> </ul> </div> <h1 id="overview">Overview</h1> <p>Character set detection is the process of determining the character set, or encoding, of character data in an unknown format. This is, at best, an imprecise operation using statistics and heuristics. In some cases, the language can be determined along with the encoding.</p> <h1 id="algorithm">Algorithm</h1> <p>Several different techniques are used for character set detection.</p> <p>For multi-byte encodings, the sequence of bytes is checked for legal patterns. The detected characters are also check against a list of frequently used characters in that encoding. </p> <p>For single byte encodings, the data is checked against a list of the most commonly occurring three letter groups for each language that can be written using that encoding. </p> <p>The detection process can be configured to optionally ignore html or xml style markup, which can interfere with the detection process by changing the statistics.</p> <p>Because of this, detection works best if you supply at least a few hundred bytes of character data that's mostly in a single language. </p> <h1 id="implementations">Implementations</h1></div>Aquileo | Home modified by YaN2013-04-16T18:01:20.810000Z2013-04-16T18:01:20.810000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.net8ea6d47050401d574c8bd89161262d2f18a2f587<div class="markdown_content"><pre>--- v4 +++ v5 @@ -1,10 +1,10 @@ Welcome to Charset Detector wiki! + +Here you can find some description about [charset detection] process. + You can download last version of Charset Detector here [[download_button]] -Here you can find some description about [charset detection] process. - - [[project_admins]] </pre> </div>Aquileo | Home modified by YaN2013-04-16T18:00:46.240000Z2013-04-16T18:00:46.240000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.net2b83017511ad04116a6ebdcfc6f210a18b894efd<div class="markdown_content"><pre>--- v3 +++ v4 @@ -4,7 +4,6 @@ [[download_button]] - Here you can find some description about [charset detection] process. </pre> </div>Aquileo | Home modified by YaN2013-04-16T17:59:05.487000Z2013-04-16T17:59:05.487000ZYaNhttps://sourceforge.net/u/userid-1329817/https://sourceforge.netc2957418a4839d330cf763e8ebd94fbf66a17e17<div class="markdown_content"><pre>--- v2 +++ v3 @@ -3,6 +3,8 @@ You can download last version of Charset Detector here [[download_button]] + + Here you can find some description about [charset detection] process. </pre> </div>