Name
    Encode::ZapCP1252 - Zap Windows Western Gremlins

Synopsis
      use Encode::ZapCP1252;

      zap_cp1252 $latin1_text;
      fix_cp1252 $utf8_text;

Description
    Have you ever been processing a Web form submit, assuming that the
    incoming text was encoded in ISO-8859-1 (Latin-1), only to end up with a
    bunch of junk because someone pasted in content from Microsoft Word?
    Well, this is because Microsoft uses a superset of the Latin-1 encoding
    called "Windows Western" or "CP1252". So mostly things will come out
    right, but a few things--like curly quotes, m-dashes, elipses, and the
    like--will not. The differences are well-known; you see a nice chart at
    documenting the differences on Wikipedia
    <http://en.wikipedia.org/wiki/Windows-1252>.

    Of course, that won't really help you. What will help you is to quit
    using Latin-1 and switch to UTF-8. Then you can just convert from CP1252
    to UTF-8 without losing a thing, just like this:

      use Encode;
      $text = decode 'cp1252', $text, 1;

    But I know that there are those of you out there stuck with Latin-1 and
    who don't want any junk charactrs from Word users, and that's where this
    module comes in. Its "zap_cp1252" function will zap those CP1252
    gremlins for you, turning them into their appropriate ASCII
    approximations.

    Another case that can occaisionally come up is when you *are* using
    UTF-8, and you're reading in text that *claims* to be UTF-8, but it
    *still* ends up with some CP1252 gremlins mixed in with true UTF-8
    characters. I've seen examples of just this sort of thing when
    processing GMail messages and attempting to insert them into a UTF-8
    database. Doesn't work so well. So this module also offers "fix_cp1252",
    which converts those CP1252 gremlines into their UTF-8 equivalents.

Usage
    This module exports two subroutines: "zap_cp1252()" and "fix_cp1252()".
    You use these subroutines like so:

      zap_cp1252 $text;
      fix_cp1252 $text;

    The "zap_cp1252()" subroutine performs *in place* conversions of any
    CP1252 gremlins into their appropriate ASCII approximations, while
    "fix_cp1252()" converts them, in place, into their UTF-8 equilvalents.

    Note that because the conversion happens in place, the data to be
    converted *cannot* be a string constant; it must be a scalar variable.
    For convenience, the converted string is also returned when the
    subroutines are called in a non-void context:

      my $fixed = zap_cp1252 $text;
      # $text and $fixed are the same.

    In Perl 5.8 and higher, the conversion will work whether the string is
    decoded to Perl's internal form (usually via "decode 'ISO-8859-1',
    $text") or the string is encoded (and thus simply processed by Perl as a
    series of bytes). The conversion will even work on a string that has not
    been decoded but has had its "utf8" flag flipped anyway (usually by an
    injudicious use of "Encode::_utf8_on()". This is to enable the highest
    possible likelyhood of removing those CP1252 gremlins no matter what
    kind of processing has already been executed on the string.

Conversion Table
    Here's how the characters are converted to ASCII and UTF-8. The ASCII
    conversions are not perfect, but they should be good enough for general
    cleanup. If you want perfect, switch to UTF-8 and be done with it!

       Hex | Char  | ASCII | UTF-8 Name
      -----+-------+-------+-------------------------------------------
      0x80 |   €   |   e   | EURO SIGN
      0x82 |   ‚   |   ,   | SINGLE LOW-9 QUOTATION MARK
      0x83 |   Æ’   |   f   | LATIN SMALL LETTER F WITH HOOK
      0x84 |   „   |   ,,  | DOUBLE LOW-9 QUOTATION MARK
      0x85 |   …   |  ...  | HORIZONTAL ELLIPSIS
      0x86 |   †   |   +   | DAGGER
      0x87 |   ‡   |   ++  | DOUBLE DAGGER
      0x88 |   ˆ   |   ^   | MODIFIER LETTER CIRCUMFLEX ACCENT
      0x89 |   ‰   |   %   | PER MILLE SIGN
      0x8a |   Å    |   S   | LATIN CAPITAL LETTER S WITH CARON
      0x8b |   ‹   |   <   | SINGLE LEFT-POINTING ANGLE QUOTATION MARK
      0x8c |   Å’   |   OE  | LATIN CAPITAL LIGATURE OE
      0x8e |   Ž   |   Z   | LATIN CAPITAL LETTER Z WITH CARON
      0x91 |   ‘   |   '   | LEFT SINGLE QUOTATION MARK
      0x92 |   ’   |   '   | RIGHT SINGLE QUOTATION MARK
      0x93 |   “   |   "   | LEFT DOUBLE QUOTATION MARK
      0x94 |   ”   |   "   | RIGHT DOUBLE QUOTATION MARK
      0x95 |   •   |   *   | BULLET
      0x96 |   –   |   -   | EN DASH
      0x97 |   —   |   --  | EM DASH
      0x98 |   ˜   |   ~   | SMALL TILDE
      0x99 |   â„¢   |  (tm) | TRADE MARK SIGN
      0x9a |   Å¡   |   s   | LATIN SMALL LETTER S WITH CARON
      0x9b |   ›   |   >   | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
      0x9c |   Å“   |   oe  | LATIN SMALL LIGATURE OE
      0x9e |   ž   |   z   | LATIN SMALL LETTER Z WITH CARON
      0x9f |   Ÿ   |   Y   | LATIN CAPITAL LETTER Y WITH DIAERESIS

  Changing the Tables
    Don't like these conversions? You can modify them to your hearts content
    by accessing this module's internal conversion tables. For example, if
    you wanted "zap_cp1252()" to use an uppercase "E" for the euro sign,
    just do this:

      local $Encode::ZapCP1252::ascii_for{"\x80"} = 'E';

    Or if, for some bizarre reason, you wanted the UTF-8 equivalent for a
    bullet converted by "fix_cp1252()" to really be an asterisk (why would
    you? Just use "zap_cp1252" for that!), you can do this:

      local $Encode::ZapCP1252::utf8_for{"\x95"} = '*';

    Just remember, without "locala" this would be a global change. In that
    case, be careful if your code zaps CP1252 elsewhere. Of course, it
    shouldn't really be doing that. These functions are just for cleaning up
    messes in one spot in your code, not for making a fundamental part of
    your text handling. For that, use Encode.

See Also
    Encode
    Wikipedia: Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>

Support
    This module is stored in an open GitHub repository
    <http://github.com/theory/encode-cp1252/tree/>. Feel free to fork and
    contribute!

    Please file bug reports via GitHub Issues
    <http://github.com/theory/encode-cp1252/issues/> or by sending mail to
    bug-Encode-CP1252@rt.cpan.org <mailto:bug-Encode-CP1252@rt.cpan.org>.

Author
    David Wheeler <david@kineticode.com>

Acknowledgements
    My thanks to Sean Burke for sending me his original method for
    converting CP1252 gremlins to more-or-less appropriate ASCII characters.

Copyright and License
    Copyright (c) 2005-2010 Kineticode, Inc. Some Rights Reserved.

    This module is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.