README 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
  1. # Supporting new or Updating languages #
  2. We generate statistical language data using Wikipedia as natural
  3. language text resource.
  4. Right now, we have automated scripts only to generate statistical data
  5. for single-byte encodings. Multi-byte encodings usually requires more
  6. in-depth knowledge of its specification.
  7. ## New single-byte encoding ##
  8. Uchardet uses language data, and therefore rather than supporting a
  9. charset, we in fact support a couple (language, charset). So for
  10. instance if uchardet supports (French, ISO-8859-15), it should be able
  11. to recognize French text encoded in ISO-8859-15, but may fail at
  12. detecting ISO-8859-15 for non-supported languages.
  13. This is why, though less flexible, it also makes uchardet much more
  14. accurate than other detection system, as well as making it an efficient
  15. language recognition system.
  16. Since many single-byte charsets actually share the same layout (or very
  17. similar ones), it is actually impossible to have an accurate single-byte
  18. encoding detector for random text.
  19. Therefore you need to describe the language and the codepoint layouts of
  20. every charset you want to add support for.
  21. I recommend having a look at langs/fr.py which is heavily commented as
  22. a base of a new language description, and charsets/windows-1252.py as a
  23. base for a new charset layout (note that charset layouts can be shared
  24. between languages. If yours is already there, you have nothing to do).
  25. The important name in the charset file are:
  26. - `name`: an iconv-compatible name.
  27. - `charmap`: fill it with CTR (control character), SYM (symbol), NUM
  28. (number), LET (letter), ILL (illegal codepoint).
  29. ## Tools ##
  30. You must install Python 3 and the [`Wikipedia` Python
  31. tool](https://github.com/goldsmith/Wikipedia).
  32. ## Run script ##
  33. Let's say you added (or modified) support for French (`fr`), run:
  34. > ./BuildLangModel.py fr --max-page=100 --max-depth=4
  35. The options can be changed to any value. Bigger values mean the script
  36. will process more data, so more processing time now, but uchardet may
  37. possibly be more accurate in the end.
  38. ## Updating core code ##
  39. If you were only updating data for a language model, you have nothing
  40. else to do. Just build `uchardet` again and test it.
  41. If you were creating new models though, you will have to add these in
  42. src/nsSBCSGroupProber.cpp and src/nsSBCharSetProber.h, and increase the
  43. value of `NUM_OF_SBCS_PROBERS` in src/nsSBCSGroupProber.h.
  44. Finally add the new file in src/CMakeLists.txt.
  45. I will be looking to make this step more straightforward in the future.