</head> <body> <h1>Expressões Regulares (RegEx) — Referência Rápida</h1> <h2 id="toc">Sumário</h2> <ul> <li><a href="#fundamentals">Fundamentos</a></li> <li><a href="#Options">Opções (há distinção entre maiúsculas e minúsculas)</a></li> <li><a href="#Common">Símbolos Comuns e Sintaxe</a></li> </ul> <h2 id="fundamentals">Fundamentos</h2> <p><strong>Encontrar em qualquer posição</strong>: Por padrão, uma expressão regular corresponde a uma substring <em>em qualquer posição</em> dentro da string a ser pesquisada. Por exemplo, a expressão regular <span class="regex">abc</span> corresponde a <span class="subj">abc</span>123, 123<span class="subj">abc</span>, e 123<span class="subj">abc</span>xyz. Para exigir que a correspondência ocorra somente no início ou no final da string, use uma <a href="#anchor">âncora</a>.</p> <p><strong>Caracteres escapados</strong>: A maioria dos caracteres como abc123 pode ser usada literalmente dentro de uma expressão regular. Entretanto, os caracteres <strong>\.*?+[{|()^$</strong> devem ser precedidos de uma contrabarra para que sejam tratados como caracteres literais. Por exempleonasmo, <span class="regex">\.</span> é um ponto literal e <span class="regex">\\</span> é uma contrabarra literal. O escape de caracteres pode ser dispensado se se usar \Q...\E. Por exemplo: <span class="regex">\QTexto Literal\E</span>.</p> <p><strong>Distinção entre maiúsculas e minúsculas</strong>: Por padrão, expressões regulares distinguem maiúsculas e minúsculas. Isso pode ser alterado com a opção "i". Por exemplo, o padrão <span class="regex">i)abc</span> procura por “abc” sem importar-se com a altura da caixa. Veja abaixo para outros modificadores.</p> <h2 id="Options">Opções (há distinção entre maiúsculas e minúsculas)</h2> <p>Bem no começo de uma expressão regular, especifique zero ou mais das opções abaixo seguidas de um fecha-parênteses. Por exemplo, o padrão <span class="regex"><span class="red">im)</span>abc</span> buscaria “abc” com as opções case-insensitive (sem distinção de M&amp;m) e multilinha (o parêntese pode ser omitido quando não houver opções). Apesar de essa sintaxe quebrar a tradição, ela não requer nenhum delimitador especial (como uma barra), e portanto não há necessidade de escapar esses delimitadores dentro do padrão. Além disso, o desempenho é melhorado porque as opções são mais fáceis de analisar.</p> <table class="info"> <tr> <th class="center">Opção</th> <th abbr="Descr">Descrição</th> </tr> <tr id="opt_i"> <td class="center bold">i</td> <td>Correspondência insensível à altura da caixa (case insensitive matching), que trata as letras A a Z como idênticas às suas correspondentes minúsculas.</td> </tr> <tr id="Multiline"> <td class="center bold">m</td> <td><p>Multilinha. Enxerga <em>Haystack</em> como uma coleção de linhas individuais (se contiver pelo menos uma quebra de linha) em vez de uma única linha contínua. Especialmente, a opção m altera o seguinte:</p> <p>1) Circunflexo (^) encontra o item que vem imediatamente depois de todas as quebras de linha internas — como também encontra o início do <em>Palheiro</em> ao qual “^” sempre corresponde (mas não encontra uma quebra de linha <em>bem no final</em> do <em>Palheiro</em>). </p> <p>2) Sinal de cifrão ($) encontra aquilo que vem antes de quaisquer quebras de linha no <em>Palheiro</em> (bem como ao final da string, à qual o $ sempre corresponde).</p> <p>Por exemplo, o padrão <span class="regex"><span class="red">m)</span>^abc$</span> corresponde a xyz`r`n<span class="subj">abc</span>. Porém, sem a opção “m”, não haveria correspondência.</p> <p>A opção “D” é ignorada quando “m” está presente.</p></td> </tr> <tr id="opt_s"> <td class="center bold">s</td> <td>“DotAll”. Faz com que o ponto final (.) corresponda a todos os caracteres inclusive quebras de linha (normalmente, o ponto não corresponde a quebras de linha). However, two dots are required to match a CRLF newline sequence (`r`n), not one. Independentemente desta opção, uma classe negada como <span class="regex">[^a]</span> sempre corresponde a quebras de linhas.</td> </tr> <tr id="opt_x"> <td class="center bold">x</td> <td>Ignora espaços em branco no padrão a não ser que estejam escapados ou dentro de uma classe de caracteres (dentro de colchetes). Os caracteres `n e `t estão entre eles porque, no momento em que eles são enxergados pelo engine do PCRE, eles já se tornaram caracteres de espaço em branco literais/brutos (em contraste, \n e \t não são ignorados porque eles são sequências de escape das PCRE. A opção “x” também ignora caracteres entre uma # não escapada fora de uma classe de caracteres e o próximo caractere de quebra de linha, inclusiuve. Isso torna possível a inclusão de comentários dentro de padrões complicados. Contudo, isso só se aplica a caracteres de dados; espaços em branco podem ocorrer dentro de sequências de caracteres especiais como (?(, o qual inicia um subpadrão condicional.</td> </tr> <tr id="opt_A"> <td class="center bold">A</td> <td>Força a ancoragem do padrão; isto é, ele só pode corresponder ao início do <em>Palheiro</em>. Sob a maioria das condições, essa opção é equivalente a ancorar explicitamente o padrão por meios como o “^”.</td> </tr> <tr id="opt_D"> <td class="center bold">D</td> <td>Força a que o cifrão ($) corresponda ao final do <em>Palheiro</em>, mesmo se o último item do <em>Palheiro</em> for uma quebra de linha. Sem essa opção, o $ passa a corresponder ao elemento que vem logo antes da quebra de linha (se houver uma). Observação: Esta opção é ignorada quando a opção “m” está presente.</td> </tr> <tr id="opt_J"> <td class="center bold">J</td> <td>Permite <a href="../lib/RegExMatch.htm#NamedSubPat">subpadrões nomeados</a> duplicados. Pode ser útil para padrões nos quais somente um de uma coleção de subpadrões com nomes idênticos puder ser correspondida. Observação: Se mais de uma instância de um nome em particular corresponder a algo, somente o que estiver mais à esquerda é armazenado. Além disso, nomes de variáveis não distinguem maiúsculas de minúsculas.</td> </tr> <tr id="opt_U"> <td class="center bold">U</td> <td>“Contentável” (NdT: antônimo de “ávido”, adjetivo aqui usado para o vocábulo “greedy”, comum no universo das RegExes.) Torna os quantificadores <strong>*</strong>, <strong>?</strong>, <strong>+</strong> e <strong>{min, máx}</strong> consumir somente aqueles caracteres absolutamente necessários para formar uma correspondência, deixando as remanescentes para a próxima parte do padrão. Quando a opção “U” não está em vigor, um quantificador individual pode ser tornado contentável ao suceder-se-o com um ponto de interrogação. De outro lado, quando “U” <em>está</em> em vigor, o ponto de interrogação torna ávido um quantificador individual.</td> </tr> <tr id="opt_extra"> <td class="center bold">X</td> <td>PCRE_EXTRA. Habilita funcionalidades das XRCP que são incompatíveis com o Perl. No momento, a única funcionalidade é que qualquer contrabarra dentro de um padrão que está seguido por uma letra que não tem significado especial faz com que a correspondência falhe e o ErrorLevel será ajustado de acordo. Esta opção ajuda a reservar sequencias não usadas com contrabarras para uso futuro. Sem essa opção, uma contrabarra seguida de uma letra sem significado especial é tratada literalmente (ex: \g e g são ambas reconhecidas como um g literal). Independentemente dessa opção, sequencias não alfabéticas com contrabarras que não têm significado especial são sempre tratadas como caracteres literais (ex: \/ e / são ambos reconhecidos como a barra “/”).</td> </tr> <tr id="opt_P"> <td class="center bold">P</td> <td>Modo posição. Este modo faz com que RegExMatch() encontre a posição e comprimento da correspondência e seus subpadrões em vez das substrings correspondentes. Para detalhes, vide <a href="../lib/RegExMatch.htm#PosMode">OutputVar</a>.</td> </tr> <tr id="opt_O"> <td class="center bold">O</td> <td>Object mode. <span class="ver">[v1.1.05+]</span>: This causes RegExMatch() to yield all information of the match and its subpatterns to a <a href="../lib/RegExMatch.htm#MatchObject">match object</a> in OutputVar. For details, see <a href="../lib/RegExMatch.htm#ObjectMode">OutputVar</a>.</td> </tr> <tr id="Study"> <td class="center bold">S</td> <td>Studies the pattern to try improve its performance. This is useful when a particular pattern (especially a complex one) will be executed many times. If PCRE finds a way to improve performance, that discovery is stored alongside the pattern in the cache for use by subsequent executions of the same pattern (subsequent uses of that pattern should also specify the S option because finding a match in the cache requires that the option letters exactly match, including their order).</td> </tr> <tr id="opt_Callout"> <td class="center bold">C</td> <td>Enables the auto-callout mode. See <a href="RegExCallout.htm#auto">Regular Expression Callouts</a> for more info.</td> </tr> <tr id="opt_esc_n"> <td class="center bold">`n</td> <td>Switches from the default newline character (`r`n) to a solitary linefeed (`n), which is the standard on UNIX systems. The chosen newline character affects the behavior of <a href="#anchor">anchors (^ and $)</a> and the <a href="#dot">dot/period pattern</a>.</td> </tr> <tr id="opt_esc_r"> <td class="center bold">`r</td> <td>Switches from the default newline character (`r`n) to a solitary carriage return (`r).</td> </tr> <tr id="NEWLINE_ANY"> <td class="center bold">`a</td> <td><span class="ver">[v1.0.46.06+]</span>: `a recognizes any type of newline, namely `r, `n, `r`n, `v/VT/vertical tab/chr(0xB), `f/FF/formfeed/chr(0xC), and NEL/next-line/chr(0x85). <span class="ver">[v1.0.47.05+]</span>: Newlines can be restricted to only CR, LF, and CRLF by instead specifying (*ANYCRLF) in uppercase at the beginning of the pattern (after the options); e.g. <span class="regex">im)(*ANYCRLF)^abc$</span>.</td> </tr> </table> <p class="note"><strong>Observação</strong>: Spaces and tabs may optionally be used to separate each option from the next.</p> <h2 id="Common">Símbolos Comuns e Sintaxe</h2> <table class="info"> <tr> <th class="center">Element</th> <th abbr="Descr">Descrição</th> </tr> <tr id="dot"> <td class="center bold">.</td> <td>By default, a dot matches any single character which is not part of a newline (`r`n) sequence, but this can be changed by using the <a href="#opt_s">DotAll (s)</a>, <a href="#opt_esc_n">linefeed (`n)</a>, <a href= "#opt_esc_r">carriage return (`r)</a>, <a href="#NEWLINE_ANY">`a or (*ANYCRLF)</a> options. For example, <span class="regex">ab.</span> matches <span class="subj">abc</span> and <span class="subj">abz</span> and <span class="subj">ab_</span>.</td> </tr> <tr> <td class="center bold">*</td> <td><p>An asterisk matches zero or more of the preceding character, <a href="#class">class</a>, or <a href="#subpat">subpattern</a>. For example, <span class="regex">a*</span> matches <span class="subj">a</span>b and <span class="subj">aaa</span>b. It also matches at the very beginning of any string that contains no "a" at all.</p> <p><strong>Wildcard:</strong> The dot-star pattern <span class="regex">.*</span> is one of the most permissive because it matches zero or more occurrences of <em>any</em> character (except newline: `r and `n). For example, <span class="regex">abc.*123</span> matches <span class="subj">abcAnything123</span> as well as <span class="subj">abc123</span>.</p> </td> </tr> <tr> <td class="center bold">?</td> <td>A question mark matches zero or one of the preceding character, <a href="#class">class</a>, or <a href="#subpat">subpattern</a>. Think of this as "the preceding item is optional". For example, <span class="regex">colou?r</span> matches both <span class="subj">color</span> and <span class="subj">colour</span> because the "u" is optional.</td> </tr> <tr> <td class="center bold">+</td> <td>A plus sign matches one or more of the preceding character, <a href="#class">class</a>, or <a href="#subpat">subpattern</a>. For example <span class="regex">a+</span> matches <span class="subj">a</span>b and <span class="subj">aaa</span>b. But unlike <span class="regex">a*</span> and <span class="regex">a?</span>, the pattern <span class="regex">a+</span> does not match at the beginning of strings that lack an "a" character.</td> </tr> <tr> <td class="center bold">{min,max}</td> <td><p>Matches between <em>min</em> and <em>max</em> occurrences of the preceding character, <a href="#class">class</a>, or <a href="#subpat">subpattern</a>. For example, <span class="regex">a{1,2}</span> matches <span class="subj">a</span>b but only the first two a's in <span class="subj">aa</span>ab.</p> <p>Also, {3} means exactly 3 occurrences, and {3<strong>,</strong>} means 3 or more occurrences. Observação: The specified numbers must be less than 65536, and the first must be less than or equal to the second.</p></td> </tr> <tr id="class"> <td class="center bold">[...]</td> <td><p><strong>Classes of characters:</strong> The square brackets enclose a list or range of characters (or both). For example, <span class="regex">[abc]</span> means "any single character that is either a, b or c". Using a dash in between creates a range; for example, <span class="regex">[a-z]</span> means "any single character that is between lowercase a and z (inclusive)". Lists and ranges may be combined; for example <span class="regex">[a-zA-Z0-9_]</span> means "any single character that is alphanumeric or underscore".</p> <p>A character class may be followed by <strong>*</strong>, <strong>?</strong>, <strong>+</strong>, or <strong>{min,max}</strong>. For example, <span class="regex">[0-9]+</span> matches one or more occurrence of any digit; thus it matches xyz<span class="subj">123</span> but not abcxyz.</p> <p>The following POSIX named sets are also supported via the form <strong>[[:xxx:]]</strong>, where xxx is one of the following words: alnum, alpha, ascii (0-127), blank (space or tab), cntrl (control character), digit (0-9), xdigit (hex digit), print, graph (print excluding space), punct, lower, upper, space (whitespace), word (same as <a href="#word">\w</a>).</p> <p>Within a character class, characters do not need to be escaped except when they have special meaning inside a class; e.g. <span class="regex">[\^a]</span>, <span class="regex">[a\-b]</span>, <span class="regex">[a\]]</span>, and <span class="regex">[\\a]</span>.</p></td> </tr> <tr> <td class="center bold">[^...]</td> <td>Matches any single character that is <strong>not</strong> in the class. For example, <span class="regex">[^/]*</span> matches zero or more occurrences of any character that is <em>not</em> a forward-slash, such as <span class="subj">http:</span>//. Similarly, <span class="regex">[^0-9xyz]</span> matches any single character that isn't a digit and isn't the letter x, y, or z.</td> </tr> <tr> <td class="center bold">\d</td> <td>Matches any single digit (equivalent to the class <span class="regex">[0-9]</span>). Conversely, capital <strong>\D</strong> means "any <em>non</em>-digit". This and the other two below can also be used inside a <a href="#class">class</a>; for example, <span class="regex">[\d.-]</span> means "any single digit, period, or minus sign".</td> </tr> <tr> <td class="center bold">\s</td> <td>Matches any single whitespace character, mainly space, tab, and newline (`r and `n). Conversely, capital <strong>\S</strong> means "any <em>non</em>-whitespace character".</td> </tr> <tr id="word"> <td class="center bold">\w</td> <td>Matches any single "word" character, namely alphanumeric or underscore. This is equivalent to <span class="regex">[a-zA-Z0-9_]</span>. Conversely, capital <strong>\W</strong> means "any <em>non</em>-word character".</td> </tr> <tr id="anchor"> <td class="center bold">^<br>$</td> <td><p>Circumflex (^) and dollar sign ($) are called <em>anchors</em> because they don't consume any characters; instead, they tie the pattern to the beginning or end of the string being searched.</p> <p><strong>^</strong> may appear at the beginning of a pattern to require the match to occur at the very beginning of a line. For example, <span class="regex">^abc</span> matches <span class="subj">abc</span>123 but not 123abc.</p> <p><strong>$</strong> may appear at the end of a pattern to require the match to occur at the very end of a line. For example, <span class="regex">abc$</span> matches 123<span class="subj">abc</span> but not abc123.</p> <p>The two anchors may be combined. For example, <span class="regex">^abc$</span> matches only <span class="subj">abc</span> (i.e. there must be no other characters before or after it).</p> <p>If the text being searched contains multiple lines, the anchors can be made to apply to each line rather than the text as a whole by means of the <a href="#Multiline">"m" option</a>. For example, <span class="regex">m)^abc$</span> matches 123`r`n<span class="subj">abc</span>`r`n789. Porém, sem a opção “m”, não haveria correspondência.</p></td> </tr> <tr> <td class="center bold">\b</td> <td><strong>\b</strong> means "word boundary", which is like an anchor because it doesn't consume any characters. It requires the current character's <a href="#word">status as a word character (\w)</a> to be the opposite of the previous character's. It is typically used to avoid accidentally matching a word that appears inside some other word. For example, <span class="regex">\bcat\b</span> doesn't match catfish, but it matches <span class="subj">cat</span> regardless of what punctuation and whitespace surrounds it. Capital <strong>\B</strong> is the opposite: it requires that the current character <em>not</em> be at a word boundary.</td> </tr> <tr> <td class="center bold">|</td> <td>The vertical bar separates two or more alternatives. A match occurs if <em>any</em> of the alternatives is satisfied. For example, <span class="regex">gray|grey</span> matches both <span class="subj">gray</span> and <span class="subj">grey</span>. Similarly, the pattern <span class="regex">gr(a|e)y</span> does the same thing with the help of the parentheses described below.</td> </tr> <tr id="subpat"> <td class="center bold">(...)</td> <td><p>Items enclosed in parentheses are most commonly used to:</p> <ul> <li>Determine the order of evaluation. For example, <span class="regex">(Sun|Mon|Tues|Wednes|Thurs|Fri|Satur)day</span> matches the name of any day.</li> <li>Apply <strong>*</strong>, <strong>?</strong>, <strong>+</strong>, or <strong>{min,max}</strong> to a <em>series</em> of characters rather than just one. For example, <span class="regex">(abc)+</span> matches one or more occurrences of the string "abc"; thus it matches <span class="subj">abcabc</span>123 but not ab123 or bc123.</li> <li id="capture">Capture a subpattern such as the dot-star in <span class="regex">abc<span class="red">(.*)</span>xyz</span>. For example, <a href="../lib/RegExMatch.htm">RegExMatch()</a> stores the substring that matches each subpattern in its <a href="../lib/RegExMatch.htm#Array">output array</a>. Similarly, <a href="../lib/RegExReplace.htm">RegExReplace()</a> allows the substring that matches each subpattern to be reinserted into the result via <a href="../lib/RegExReplace.htm#BackRef">backreferences</a> like $1. To use the parentheses without the side-effect of capturing a subpattern, specify <strong>?:</strong> as the first two characters inside the parentheses; for example: <span class="regex">(<span class="red">?:</span>.*)</span></li> <li>Change <a href="#Options">options</a> on-the-fly. For example, <span class="regex">(?im)</span> turns on the case-insensitive and multiline options for the remainder of the pattern (or subpattern if it occurs inside a subpattern). Conversely, <span class="regex">(?-im)</span> would turn them both off. All options are supported except DPS`r`n`a.</li> </ul></td> </tr> <tr> <td class="center bold">\t<br>\r<br>etc.</td> <td><p>These escape sequences stand for special characters. The most common ones are <strong>\t</strong> (tab), <strong>\r</strong> (carriage return), and <strong>\n</strong> (linefeed). In AutoHotkey, an accent (`) may optionally be used in place of the backslash in these cases. Escape sequences in the form <strong>\xhh</strong> are also supported, in which <em>hh</em> is the hex code of any ANSI character between 00 and FF.</p> <p><span class="ver">[v1.0.46.06+]</span>: <strong>\R</strong> means "any single newline of any type", namely those listed at the <a href="#NEWLINE_ANY">`a option</a> (however, \R inside a <a href="#class">character class</a> is merely the letter "R"). <span class="ver">[v1.0.47.05+]</span>: \R can be restricted to CR, LF, and CRLF by specifying (*BSR_ANYCRLF) in uppercase at the beginning of the pattern (after the options); e.g. <span class="regex">im)(*BSR_ANYCRLF)abc\Rxyz</span></p></td> </tr> <tr id="slashP"> <td class="center bold">\p{xx}<br>\P{xx}<br>\X</td> <td><p><span class="ver">[AHK_L 61+]:</span> Unicode character properties. Not supported on ANSI builds. <strong>\p{xx}</strong> matches a character with the xx property while <strong>\P{xx}</strong> matches any character <i>without</i> the xx property. For example, <span class="regex">\pL</span> matches any letter and <span class="regex">\p{Lu}</span> matches any upper-case letter. <strong>\X</strong> matches any number of characters that form an extended Unicode sequence.</p> <p>For a full list of supported property names and other details, search for "\p{xx}" at <a href="http://www.pcre.org/pcre.txt">www.pcre.org/pcre.txt</a>.</p></td> </tr> <tr id="UCP"> <td class="center bold">(*UCP)</td> <td><p><span class="ver">[AHK_L 61+]:</span> For performance, \d, \D, \s, \S, \w, \W, \b and \B recognize only ASCII characters by default, even on Unicode builds. If the pattern begins with <strong>(*UCP)</strong>, Unicode properties will be used to determine which characters match. For example, \w becomes equivalent to <span class="regex">[\p{L}\p{N}_]</span> and \d becomes equivalent to <span class="regex">\p{Nd}</span>.</p> </td> </tr> </table> <p><strong>Greed</strong>: By default, <strong>*</strong>, <strong>?</strong>, <strong>+</strong>, and <strong>{min,max}</strong> are greedy because they consume all characters up through the <em>last</em> possible one that still satisfies the entire pattern. To instead have them stop at the <em>first</em> possible character, follow them with a question mark. For example, the pattern <span class="regex">&lt;.+&gt;</span> (which lacks a question mark) means: "search for a &lt;, followed by one or more of any character, followed by a &gt;". To stop this pattern from matching the <em>entire</em> string <span class="subj"><span class="red"><strong>&lt;</strong></span>em&gt;text&lt;/em<span class="red"><strong>&gt;</strong></span></span>, append a question mark to the plus sign: <span class="regex">&lt;.+<span class="red">?</span>&gt;</span>. This causes the match to stop at the first '&gt;' and thus it matches only the first tag <span class="subj"><span class="red"><strong>&lt;</strong></span>em<span class="red"><strong>&gt;</strong></span></span>.</p> <p><strong>Look-ahead and look-behind assertions</strong>: The groups <span class="regex">(?=...)</span>, <span class="regex">(?!...)</span>, <span class="regex">(?&lt;=...)</span>, and <span class="regex">(?&lt;!...)</span> are called <em>assertions</em> because they demand a condition to be met but don't consume any characters. For example, <span class="regex">abc(?=.*xyz)</span> is a look-ahead assertion that requires the string xyz to exist somewhere to the right of the string abc (if it doesn't, the entire pattern is not considered a match). <span class="regex">(?=...)</span> is called a <em>positive</em> look-ahead because it requires that the specified pattern exist. Conversely, <span class="regex">(?!...)</span> is a <em>negative</em> look-ahead because it requires that the specified pattern <em>not</em> exist. Similarly, <span class="regex">(?&lt;=...)</span> and <span class="regex">(?&lt;!...)</span> are positive and negative look-<em>behinds</em> (respectively) because they look to the <em>left</em> of the current position rather than the right. Look-behinds are more limited than look-aheads because they do not support quantifiers of varying size such as <strong>*</strong>, <strong>?</strong>, and <strong>+</strong>. The escape sequence <strong>\K</strong> is similar to a look-behind assertion because it causes any previously-matched characters to be omitted from the final matched string. For example, <span class="regex">foo\Kbar</span> matches "foobar" but reports that it has matched "bar".</p> <p><strong>Related</strong>: Regular expressions are supported by <a href="../lib/RegExMatch.htm">RegExMatch()</a>, <a href="../lib/RegExReplace.htm">RegExReplace()</a>, and <a href="../lib/SetTitleMatchMode.htm">SetTitleMatchMode</a>.</p> <p class="note"><strong>Final note</strong>: Although this page touches upon most of the commonly-used RegEx features, there are quite a few other features you may want to explore such as conditional subpatterns. The complete PCRE manual is at <a href="http://www.pcre.org/pcre.txt">www.pcre.org/pcre.txt</a></p> </body> </html>