LinuxGuruz
  • Last 5 Forum Topics
    Replies
    Views
    Last post


The Web Only This Site
  • BOOKMARK

  • ADD TO FAVORITES

  • REFERENCES


  • MARC

    Mailing list ARChives
    - Search by -
     Subjects
     Authors
     Bodies





    FOLDOC

    Computing Dictionary




  • Text Link Ads






  • LINUX man pages
  • Linux Man Page Viewer


    The following form allows you to view linux man pages.

    Command:

    pcre

    
    
    

    INTRODUCTION

    
           The  PCRE  library is a set of functions that implement regular expres-
           sion pattern matching using the same syntax and semantics as Perl, with
           just  a  few  differences. Certain features that appeared in Python and
           PCRE before they appeared in Perl are also available using  the  Python
           syntax.  There is also some support for certain .NET and Oniguruma syn-
           tax items, and there is an option for  requesting  some  minor  changes
           that give better JavaScript compatibility.
    
           The  current  implementation of PCRE (release 7.x) corresponds approxi-
           mately with Perl 5.10, including support for UTF-8 encoded strings  and
           Unicode general category properties. However, UTF-8 and Unicode support
           has to be explicitly enabled; it is not the default. The Unicode tables
           correspond to Unicode release 5.0.0.
    
           In  addition to the Perl-compatible matching function, PCRE contains an
           alternative matching function that matches the same  compiled  patterns
           in  a different way. In certain circumstances, the alternative function
           has some advantages. For a discussion of the two  matching  algorithms,
           see the pcrematching page.
    
           PCRE  is  written  in C and released as a C library. A number of people
           have written wrappers and interfaces of various kinds.  In  particular,
           Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
           included as part of the PCRE distribution. The pcrecpp page has details
           of  this  interface.  Other  people's contributions can be found in the
           Contrib directory at the primary FTP site, which is:
    
           ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
    
           Details of exactly which Perl regular expression features are  and  are
           not supported by PCRE are given in separate documents. See the pcrepat-
           tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
           page.
    
           Some  features  of  PCRE can be included, excluded, or changed when the
           library is built. The pcre_config() function makes it  possible  for  a
           client  to  discover  which  features are available. The features them-
           selves are described in the pcrebuild page. Documentation about  build-
           ing  PCRE for various operating systems can be found in the README file
           in the source distribution.
    
           The library contains a number of undocumented  internal  functions  and
           data  tables  that  are  used by more than one of the exported external
           functions, but which are not intended  for  use  by  external  callers.
           Their  names  all begin with "_pcre_", which hopefully will not provoke
           any name clashes. In some environments, it is possible to control which
           external  symbols  are  exported when a shared library is built, and in
           these cases the undocumented symbols are not exported.
    
    
    

    USER DOCUMENTATION

             pcregrep          description of the pcregrep command
             pcrematching      discussion of the two matching algorithms
             pcrepartial       details of the partial matching facility
             pcrepattern       syntax and semantics of supported
                                 regular expressions
             pcresyntax        quick syntax reference
             pcreperform       discussion of performance issues
             pcreposix         the POSIX-compatible C API
             pcreprecompile    details of saving and re-using precompiled patterns
             pcresample        discussion of the sample program
             pcrestack         discussion of stack usage
             pcretest          description of the pcretest testing command
    
           In  addition,  in the "man" and HTML formats, there is a short page for
           each C library function, listing its arguments and results.
    
    
    

    LIMITATIONS

    
           There are some size limitations in PCRE but it is hoped that they  will
           never in practice be relevant.
    
           The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
           is compiled with the default internal linkage size of 2. If you want to
           process  regular  expressions  that are truly enormous, you can compile
           PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
           the  source  distribution and the pcrebuild documentation for details).
           In these cases the limit is substantially larger.  However,  the  speed
           of execution is slower.
    
           All values in repeating quantifiers must be less than 65536.
    
           There is no limit to the number of parenthesized subpatterns, but there
           can be no more than 65535 capturing subpatterns.
    
           The maximum length of name for a named subpattern is 32 characters, and
           the maximum number of named subpatterns is 10000.
    
           The  maximum  length of a subject string is the largest positive number
           that an integer variable can hold. However, when using the  traditional
           matching function, PCRE uses recursion to handle subpatterns and indef-
           inite repetition.  This means that the available stack space may  limit
           the size of a subject string that can be processed by certain patterns.
           For a discussion of stack issues, see the pcrestack documentation.
    
    
    

    UTF-8 AND UNICODE PROPERTY SUPPORT

    
           From release 3.3, PCRE has  had  some  support  for  character  strings
           encoded  in the UTF-8 format. For release 4.0 this was greatly extended
           to cover most common requirements, and in release 5.0  additional  sup-
           port for Unicode general category properties was added.
    
           In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
           for a decimal number, the Unicode script names such as Arabic  or  Han,
           and  the  derived  properties  Any  and L&. A full list is given in the
           pcrepattern documentation. Only the short names for properties are sup-
           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
           does not support this.
    
       Validity of UTF-8 strings
    
           When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and
           subjects are (by default) checked for validity on entry to the relevant
           functions. From release 7.3 of PCRE, the check is according  the  rules
           of  RFC  3629, which are themselves derived from the Unicode specifica-
           tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which
           allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current
           check allows only values in the range U+0 to U+10FFFF, excluding U+D800
           to U+DFFF.
    
           The  excluded  code  points are the "Low Surrogate Area" of Unicode, of
           which the Unicode Standard says this: "The Low Surrogate Area does  not
           contain  any  character  assignments,  consequently  no  character code
           charts or namelists are provided for this area. Surrogates are reserved
           for  use  with  UTF-16 and then must be used in pairs." The code points
           that are encoded by UTF-16 pairs  are  available  as  independent  code
           points  in  the  UTF-8  encoding.  (In other words, the whole surrogate
           thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
    
           If an invalid UTF-8 string is passed to PCRE, an error return is given.
           At  compile  time, the only additional information is the offset to the
           first byte of the failing character. The runtime functions (pcre_exec()
           and  pcre_dfa_exec()),  pass  back  this  information as well as a more
           detailed reason code if the caller has provided memory in which  to  do
           this.
    
           In  some  situations, you may already know that your strings are valid,
           and therefore want to skip these checks in  order  to  improve  perfor-
           mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run
           time, PCRE assumes that the pattern or subject  it  is  given  (respec-
           tively)  contains  only  valid  UTF-8  codes. In this case, it does not
           diagnose an invalid UTF-8 string.
    
           If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,
           what  happens  depends on why the string is invalid. If the string con-
           forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
           string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,
           apart from the initial validity test, PCRE (when in UTF-8 mode) handles
           strings  according  to  the more liberal rules of RFC 2279. However, if
           the string does not even conform to RFC 2279, the result is  undefined.
           Your program may crash.
    
           If  you  want  to  process  strings  of  values  in the full range 0 to
    
           4.  The dot metacharacter matches one UTF-8 character instead of a sin-
           gle byte.
    
           5. The escape sequence \C can be used to match a single byte  in  UTF-8
           mode,  but  its  use can lead to some strange effects. This facility is
           not available in the alternative matching function, pcre_dfa_exec().
    
           6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
           test  characters of any code value, but the characters that PCRE recog-
           nizes as digits, spaces, or word characters  remain  the  same  set  as
           before, all with values less than 256. This remains true even when PCRE
           includes Unicode property support, because to do otherwise  would  slow
           down  PCRE in many common cases. If you really want to test for a wider
           sense of, say, "digit", you must use Unicode  property  tests  such  as
           \p{Nd}.
    
           7.  Similarly,  characters that match the POSIX named character classes
           are all low-valued characters.
    
           8. However, the Perl 5.10 horizontal and vertical white space  matching
           escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
           acters.
    
           9. Case-insensitive matching applies only to  characters  whose  values
           are  less than 128, unless PCRE is built with Unicode property support.
           Even when Unicode property support is available, PCRE  still  uses  its
           own  character  tables when checking the case of low-valued characters,
           so as not to degrade performance.  The Unicode property information  is
           used only for characters with higher values. Even when Unicode property
           support is available, PCRE supports case-insensitive matching only when
           there  is  a  one-to-one  mapping between a letter's cases. There are a
           small number of many-to-one mappings in Unicode;  these  are  not  sup-
           ported by PCRE.
    
    
    

    AUTHOR

    
           Philip Hazel
           University Computing Service
           Cambridge CB2 3QH, England.
    
           Putting  an actual email address here seems to have been a spam magnet,
           so I've taken it away. If you want to email me, use  my  two  initials,
           followed by the two digits 10, at the domain cam.ac.uk.
    
    
    

    REVISION

    
           Last updated: 12 April 2008
           Copyright (c) 1997-2011 University of Cambridge.
    
    
  • MORE RESOURCE


  • Linux

    The Distributions





    Linux

    The Software





    Linux

    The News



  • MARKETING






  • Toll Free

webmaster@linuxguruz.com
Copyright © 1999 - 2016 by LinuxGuruz