Regular Expressions Notepad++ regular expressions (“regex”) use the Boost regular expression library v1.85 (as of NPP v8.6.6), which was originally based on PCRE (Perl Compatible Regular Expression) syntax, only departing from it in very minor ways. Complete documentation on the precise implementation is to be found on the Boost pages for search syntax and replacement syntax. (Some users have misunderstood this paragraph to mean that they can use one of the regex-explainer websites that accepts PCRE and expect anything that works there to also work in Notepad++; this is not accurate. There are many different “PCRE” implimentations, and Boost itself does not claim to be “PCRE”, though both Boost and PCRE variants have the same origins in an early version of Perl’s regex engine. If your regex-explainer does not claim to use the same Boost engine as Notepad++ uses, there will be differences between the results from your chosen website and the results that Notepad++ gives.) The Notepad++ Community has a FAQ on other resources for regular expressions. Note: Regular expression “backward” search is disallowed due to sometimes surprising results. (For example, in the text to the test they travelled, a forward regex t\w+ will find 5 results; the same regex searching backward will find 17 matches.) If you really need this feature, please see Allow regex backward search to learn how to activate this option. Important Note: Syntax that works in the Find What: box for searching will not always work in the Replace with: box for replacement. There are different syntaxes. The Control Characters and Match by character code syntax work in both; other than that, see the individual sections for Searches vs Substitutions for which syntaxes are valid in which fields. Regex Special Characters for Searches In a regular expression (shortened into regex throughout), special characters interpreted are: Single-character matches . or \C ⇒ Matches any character. If you check the box which says . matches newline, or use the (?s) search modifier, then . or \C will match any character, including newline characters (\r or \n). With the option unchecked, or using the (?-s) search modifier, . or \C only match characters within a line, and do not match the newline characters. Any Unicode character within the Basic Multilingual Plane (BMP) (with a codepoint from U+0000 through U+FFFF) will be matched per these rules. Any Unicode character that is beyond the BMP (with a codepoint from U+10000 through U+10FFFF) will be matched as two separate characters instead, since the “surrogate code” uses two characters. (See the Match by Character Code section for more on how surrogate codes work.) \X ⇒ Matches a single non-combining character followed by any number (zero or more) combining characters. You can think of \X as a “. on steroids”: it matches the whole grapheme as a unit, not just the base character itself. This is useful if you have a Unicode encoded text with accents as separate, combining characters. For example, the letter ǭ̳̚, with four combining characters after the o, can be found either with the regex (?-i)o\x{0304}\x{0328}\x{031a}\x{0333} or with the shorter regex \X (the latter, being generic, matches more than just ǭ̳̚, inluding but not limited to ą̳̄̚ or o alone); if you want to limit the \X in this example to just match a possibly-modified o (so “o followed by 0 or more modifiers”), use a lookahead before the \X: (?=o)\X, which would match o alone or ǭ̳̚, but not ą̳̄̚. \$ , \( , \) , \* , \+ , \. , \? , \[ , \] , \\ , \| ⇒ Prefixing a special character with \ to “escape” the character allows you to search for a literal character when the regular expression syntax would otherwise have that character have a special meaning as a regex meta-character. The characters $ ( ) * + . ? [ ] \ | all have special meaning to the regex engine in normal circumstances; to get them to match as a literal (or to show up as a literal in the substitution), you will have to prefix them with the \ character. There are also other characters which are special only in certain circumstances (any time a character is used with a non-literal meaning throughout the Regular Expression section of this manual); if you want to match one of those sometimes-special characters as literal character in those situations, those sometimes-special characters will also have to be escaped in those situations by putting a \ before it. Please note: if you escape a normal character, it will sometimes gain a special meaning; this is why so many of the syntax items listed in this section have a \ before them. Match by character code It is possible to match any character using its character code. This allows searching for any character, even if you cannot type it into the Find box, or the Find box doesn’t seem to match your emoji that you want to search for. If you are using an ANSI encoding in your document (that is, using a character set like Windows 1252), you can use any character code with a decimal codepoint from 0 to 255. If you are using Unicode (one of the UTF-8 or UTF-16 encodings), you can actually match any Unicode character. These notations require knowledge of hexadecimal or octal versions of the character code. (You can find such character code information on most web pages about ASCII, or about your selected character set, and about UTF-8 and UTF-16 representations of Unicode characters.) \0ℕℕℕ ⇒ A single byte character whose code in octal is ℕℕℕ, where each ℕ is an octal digit. (That’s the number 0, not the letter o or O.) This notation works for for codepoints 0-255 (\0000 - \0377), which covers the full ANSI character set range, or the first 256 Unicode characters. For example, \0101 looks for the letter A, as 101 in octal is 65 in decimal, and 65 is the character code for A in ASCII, in most of the character sets, and in Unicode. \xℕℕ ⇒ Specify a single character with code ℕℕ, where each ℕ is a hexadecimal digit. What this stands for depends on the text encoding. This notation works for codepoints 0-255 (\x00 - \xFF), which covers the full ANSI character set range, or the first 256 Unicode characters. For instance, \xE9 may match an é or a θ depending on the character set (also known as the “code page”) in an ANSI encoded document. These next two only work with Unicode encodings (so the various UTF-8 and UTF-16 encodings): \x{ℕℕℕℕ} ⇒ Like \xℕℕ, but matches a full 16-bit Unicode character, which is any codepoint from U+0000 to U+FFFF. \x{ℕℕℕℕ}\x{ℕℕℕℕ} ⇒ For Unicode characters above U+FFFF, in the range U+10000 to U+10FFFF, you need to break the single 5-digit or 6-digit hex value and encode it into two 4-digit hex codes; these two codes are the “surrogate codes” for the character. For example, to search for the 🚂 STEAM LOCOMOTIVE character at U+1F682, you would search for the surrogate codes \x{D83D}\x{DE82}. If you want to know the surrogate codes for a given character, search the internet for “surrogate codes for character” (where character is the fancy Unicode character you need the codes for); the surrogate codes are equivalent to the two-word UTF-16 encoding for those higher characters, so UTF-16 tables will also work for looking this up. Any site or tool that you are likely to be using to find the U+###### for a given Unicode character will probably already give you the surrogate codes or UTF-16 words for the same character; if not, find a tool or site that does. You can also compute surrogate codes yourself from the character code, but only if you are comfortable with hexadecimal and binary. Skip the following bullets if you are prone to mathematics-based PTSD. Start with your Unicode U+######, calling the hexadecimal digits as PPWXYZ. The PP digits indicate the plane. subtract one and convert to the 4 binary bits pppp (so PP=01 becomes 0000, PP=0F becomes 1110, and PP=10 becomes 1111) Convert each of the other digits into 4 bits (W as wwww, X as xxvv, Y as yyyy, and Z as zzzz; you will see in a moment why two different characters are used in xxvv) Write those 20 bits in sequence: ppppwwwwxxvvyyyyzzzz Group into two equal groups: ppppwwwwxx and vvyyyyzzzz (you can see that the X ⇒ xxvv was split between the two groups, hence the notation) Before the first group, insert the binary digits 110110 to get 110110ppppwwwwxx, and split into the nibbles 1101 10pp ppww wwxx. Convert those nibbles to hex: it will give you a value from \x{D800} thru \x{DBFF}; this is the High Surrogate code Before the second group, insert the binary digits 110111 to get 110111vvyyyyzzzz, and split into the nibbles 1101 11vv yyyy zzzz. Convert those nibbles to hex: it will give you a value from \x{DC00} thru \x{DFFF}; this is the Low Surrogate code Combine those into the final \x{ℕℕℕℕ}\x{ℕℕℕℕ} for searching. For more on this, see the Wikipedia article on Unicode Planes, and the discussion in the Notepad++ Community Forum about how to search for non-ASCII characters Collating Sequences [[._col_.]] ⇒ The character the col “collating sequence” stands for. For instance, in Spanish, ch is a single letter, though it is written using two characters. That letter would be represented as [[.ch.]]. This trick also works with symbolic names of control characters, like [[.BEL.]] for the character of code 0x07. See also the discussion on character ranges. Control characters \a ⇒ The BEL control character 0x07 (alarm). \b ⇒ The BS control character 0x08 (backspace). This is only allowed inside a character class definition. Otherwise, this means “a word boundary”. \e ⇒ The ESC control character 0x1B. \f ⇒ The FF control character 0x0C (form feed). \n ⇒ The LF control character 0x0A (line feed). This is the regular end of line under Unix systems. \r ⇒ The CR control character 0x0D (carriage return). This is part of the DOS/Windows end of line sequence CR-LF, and was the EOL character on Mac 9 and earlier. OSX and later versions use \n. \t ⇒ The TAB control character 0x09 (tab, or hard tab, horizontal tab). \c☒ ⇒ The control character obtained from character ☒ by stripping all but its 5 lowest order bits. For instance, \cA and \ca both stand for the SOH control character 0x01. You can think of this as “\c means ctrl”, so \cA is the character you would get from hitting Ctrl+A in a terminal. (Note that \c☒ will not work if ☒ is outside of the Basic Multilingual Plane (BMP) – that is, it only works if ☒ is in the Unicode character range U+0000 - U+FFFF. The intention of \c☒ is to mnemonically escape the ASCII control characters obtained by typing Ctrl+☒, it is expected that you will use a simple ASCII alphanumeric for the ☒, like \cA or \ca.) Special Control escapes \R ⇒ Any newline sequence. Specifically, the atomic group (?>\r\n|\n|\x0B|\f|\r|\x85|\x{2028}|\x{2029}). Please note, this sequence might match one or two characters, depending on the text. Because its length is variable-width, it cannot be used in lookbehinds. Because it expands to a parentheses-based group with an alternation sequence, it cannot be used inside a character class. If you accidentally attempt to put it in a character class, it will be interpreted like any other literal-character escape (where \☒ is used to make sure that the next character is literal) meaning that the R will be taken as a literal R, without any special meaning. For example, if you try [\t\R]: you may be intending to say, “match any single character that’s a tab or a newline”, but what you are actually saying is “match the tab or a literal R”; to get what you probably intended, use [\t\v] for “a tab or any vertical spacing character”, or [\t\r\n] for “a tab or carriage return or newline but not any of the weird verticals”. Ranges or kinds of characters Character Classes [_set_] ⇒ This indicates a set of characters, for example, [abc] means any of the literal characters a, b or c. You can also use ranges by putting a hyphen between characters, for example [a-z] for any character from a to z. You can use a collating sequence in character ranges, like in [[.ch.]-[.ll.]] (these are collating sequences in Spanish). Certain characters require special treatment inside character classes: To use a literal - in a character class: Use it directly as the first or last character in the enclosing class notation, like [-abc] or [abc-]; OR use it “escaped” at any position, like [\-abc] or [a\-bc] . To use a literal ] in a character class: Use it directly right after the opening [ of the class notation, like []abc]; OR use it “escaped” at any position, like [\]abc] or [a\]bc] . To use a literal [ in a character class: Use it directly like any other character, like [ab[c]; “escaping” is not necessary, but is permissible, like [ab\[c] . This character is not special when used alone inside a class; however, there are cases where it is special in combination with another: If used with a colon in the order [: inside a class, it is the opening sequence for a named class (described below); if you want to include both a [ and a : inside the same character class, do not use them unescaped right next to each other; either change the order, like [:[], or escape one or both, like [\[:] or [[\:] or [\[\:] . If used with an equals sign in the order [= inside a class, it is the opening sequence for an equivalence class (described below); if you want to include both a [ and a = inside the same character class, do not use them unescaped right next to each other; either change the order, like [=[], or escape one or both, like [\[=] or [[\=] or [\[\=] . To use a literal \ in a character class, it must be doubled (i.e., \\) inside the enclosing class notation, like [ab\\c] . To use a literal ^ in a character class: Use it directly as any character but the first, such as [a^b] or [ab^]; OR use it “escaped” at any position, such as [\^ab] or [a\^b] or [ab\^] . [^_set_] ⇒ The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first A, B or C (or a, b or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n]. [[:_name_:]] or [[:☒:]] ⇒ The whole character class named name. For many, there is also a single-letter “short” class name, ☒. Please note: the [:_name_:] and [:☒:] must be inside a character class [...] to have their special meaning. short full name description equivalent character class alnum letters and digits alpha letters h blank spacing which is not a line terminator [\t\x20\xA0] cntrl control characters [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] d digit digits graph graphical character, so essentially any character except for control chars, \0x7F, \x80 l lower lowercase letters print printable characters [\s[:graph:]] punct punctuation characters [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{\|}~] s space whitespace (word or line separator) [\t\n\x0B\f\r\x20\x85\xA0\x{2028}\x{2029}] u upper uppercase letters unicode any character with code point above 255 [\x{0100}-\x{FFFF}] w word word characters [_\d\l\u] xdigit hexadecimal digits [0-9A-Fa-f] Note that letters include any unicode letters (ASCII letters, accented letters, and letters from a variety of other writing systems); digits include ASCII numeric digits, and anything else in Unicode that’s classified as a digit (like superscript numbers ¹²³…). Note that those character class names may be written in upper or lower case without changing the results. So [[:alnum:]] is the same as [[:ALNUM:]] or the mixed-case [[:AlNuM:]]. As stated earlier, the [:_name_:] and [:☒:] (note the single brackets) must be a part of a surrounding character class. However, you may combine them inside one character class, such as [_[:d:]x[:upper:]=], which is a character class that would match any digit, any uppercase, the lowercase x, and the literal _ and = characters. These named classes won’t always appear with the double brackets, but they will always be inside of a character class. If the [:_name_:] or [:☒:] are accidentally not contained inside a surrounding character class, they will lose their special meaning. For example, [:upper:] is the character class matching :, u, p, e, and r; whereas [[:upper:]] is similar to [A-Z] (plus other unicode uppercase letters) [^[:_name_:]] or [^[:☒:]] ⇒ The complement of character class named name or ☒ (matching anything not in that named class). This uses the same long names, short names, and rules as mentioned in the previous description. Character classes may not contain parentheses-based groups of any kind, including the special escape \R (which expands to a parentheses-based group when evaluated, even though \R doesn’t look like it contains parentheses). Character Properties These properties behave similar to named character classes, but cannot be contained inside a character class. \p☒ or \p{_name_} ⇒ Same as [[:☒:]] or [[:_name_:]], where ☒ stands for one of the short names from the table above, and name stands for one of the full names from above. For instance, \pd and \p{digit} both stand for a digit, just like the escape sequence \d does. \P☒ or \P{_name_} ⇒ Same as [^[:☒:]] or [^[:_name_:]] (not belonging to the class name). Character escape sequences \☒ ⇒ Where ☒ is one of d, w, l, u, s, h, v, described below. These single-letter escape sequences are each equivalent to a class from above. The lower-case escape sequence means it matches that class; the upper-case escape sequence means it matches the negative of that class. (Unlike the properties, these can be used both inside or outside of a character class.) Description Escape Sequence Positive Class Negative Escape Sequence Negative Class digits \d [[:digit:]] \D [^[:digit:]] word chars \w [[:word:]] \W [^[:word:]] lowercase \l [[:lower:]] \L [^[:lower:]] uppercase \u [[:upper:]] \U [^[:upper:]] word/line separators \s [[:space:]] \S [^[:space:]] horizontal space \h [[:blank:]] \H [^[:blank:]] vertical space \v see below \V Vertical space: This encompasses all the [[:space:]] characters that aren’t [[:blank:]] characters: The LF, VT, FF, CR , NEL control characters and the LS and PS format characters: 0x000A (line feed), 0x000B (vertical tabulation), 0x000C (form feed), 0x000D (carriage return), 0x0085 (next line), 0x2028 (line separator) and 0x2029 (paragraph separator). There isn’t a named class which matches. Note: despite its similarity to \v, even though \R matches certain vertical space characters, it is not a character-class-equivalent escape sequence (because it evaluates to a parentheses()-based expression, not a class-based expression). So while \d, \l, \s, \u, \w, \h, and \v are all equivalent to a character class and can be included inside another bracket[]-based character class, the \R is not equivalent to a character class, and cannot be included inside a bracketed[] character-class. Equivalence Classes [[=_char_=]] ⇒ All characters that differ from char by case, accent or similar alteration only. For example [[=a=]] matches any of the characters: A, À, Á, Â, Ã, Ä, Å, a, à, á, â, ã, ä and å. Multiplying operators + ⇒ This matches 1 or more instances of the previous character, as many as it can. For example, Sa+m matches Sam, Saam, Saaam, and so on. [aeiou]+ matches consecutive strings of vowels. * ⇒ This matches 0 or more instances of the previous character, as many as it can. For example, Sa*m matches Sm, Sam, Saam, and so on. ? ⇒ Zero or one of the last character. Thus Sa?m matches Sm and Sam, but not Saam. *? ⇒ Zero or more of the previous group, but minimally: the shortest matching string, rather than the longest string as with the “greedy” operator. Thus, m.*?o applied to the text margin-bottom: 0; will match margin-bo, whereas m.*o will match margin-botto. +? ⇒ One or more of the previous group, but minimally. {ℕ} ⇒ Matches ℕ copies of the element it applies to (where ℕ is any decimal number). {ℕ,} ⇒ Matches ℕ or more copies of the element it applies to. {ℕ,ℙ} ⇒ Matches ℕ to ℙ copies of the element it applies to, as much it can (where ℙ ≥ ℕ). {ℕ,}? or {ℕ,ℙ}? ⇒ Like the above, but minimally. *+ or ?+ or ++ or {ℕ,}+ or {ℕ,ℙ}+ ⇒ These so called “possessive” variants of greedy repeat marks do not backtrack. This allows failures to be reported much earlier, which can boost performance significantly. But they will eliminate matches that would require backtracking to be found. As an example, see how the matching engine handles the following two regexes: When regex “.*” is run against the text “abc”x : `“` matches `“` `.*` matches `abc”x` `”` doesn't match ( End of line ) => Backtracking `.*` matches `abc”` `”` doesn't match letter `x` => Backtracking `.*` matches `abc` `”` matches `”` => 1 overall match `“abc”` When regex “.*+”, with a possessive quantifier, is run against the text “abc”x : `“` matches `“` `.*+` matches `abc”x` ( catches all remaining characters ) `”` doesn't match ( End of line ) Notice there is no match at all in this version, because the possessive quantifier prevents backtracking to a possible solution. Anchors Anchors match a zero-length position in the line, rather than a particular character. ^ ⇒ This matches the start of a line (except when used inside a set, see above). $ ⇒ This matches the end of a line. \< ⇒ This matches the start of a word using Boost’s definition of words. \> ⇒ This matches the end of a word using Boost’s definition of words. \b ⇒ Matches either the start or end of a word. \B ⇒ Not a word boundary. It represents any location between two word characters or between two non-word characters. \A or \` ⇒ Matches the start of the file. \z or \' ⇒ Matches the end of the file. \Z ⇒ Matches like \z with an optional sequence of newlines before it. This is equivalent to (?=\v*\z), which departs from the traditional Perl meaning for this escape. \G ⇒ This “Continuation Escape” matches the end of the previous match, or matches the start of the text being matched if no previous match was found. In Find All or Replace All circumstances, this will allow you to anchor your next match at the end of the previous match. If it is the first match of a Find All or Replace All, and any time you use a single Find Next or Replace, the “end of previous match” is defined to be the start of the search area – the beginning of the document, or the current caret position, or the start of the highlighted text. Because of that, if you are using it in an alternation, where you want to say “find any occurrence of something after some prefix, or after a previous match), you will want to make sure that your prefix includes the start-of-file \A, otherwise the \G portion may accidentally match start-of-file when you don’t want that to occur. Capture Groups and Backreferences (_subset_) ⇒ Numbered Capture Group: Parentheses mark a part of the regular expression, also known as a subset expression or capture group. The string matched by the contents of the parentheses (indicated by subset in this example) can be re-used with a backreference or as part of a replace operation; see Substitutions, below. Groups may be nested. (?<name>_subset_) or (?'name'_subset_) ⇒ Named Capture Group: Names the value matched by subset as the group name. Please note that group names are case-sensitive. \ℕ, \gℕ, \g{ℕ}, \g<ℕ>, \g'ℕ', \kℕ, \k{ℕ}, \k<ℕ> or \k'ℕ' ⇒ Numbered Backreference: These syntaxes match the ℕth capture group earlier in the same expression. (Backreferences are used to refer to the capture group contents only in the search/match expression; see the Substitution Escape Sequences for how to refer to capture groups in substitutions/replacements.) A regex can have multiple subgroups, so \2, \3, etc. can be used to match others (numbers advance left to right with the opening parenthesis of the group). You can have as many capture groups as you need, and are not limited to only 9 groups (though some of the syntax variants can only reference groups 1-9; see the notes below, and use the syntaxes that explicitly allow multi-digit ℕ if you have more than 9 groups) Example: ([Cc][Aa][Ss][Ee]).*\1 would match a line such as Case matches Case but not Case doesn't match cASE. \ℕ ⇒ This form can only have ℕ as digits 1-9, so if you have more than 9 capture groups, you will have to use one of the other numbered backreference notations, listed in the next bullet point. Example: the expression \10 matches the contents of the first capture group \1 followed by the literal character 0”, not the contents of the 10th group. \gℕ, \g{ℕ}, \g<ℕ>, \g'ℕ', \kℕ, \k{ℕ}, \k<ℕ> or \k'ℕ' ⇒ These forms can handle any non-zero ℕ. For positive ℕ, it matches the ℕth subgroup, even if ℕ has more than one digit. \g10 matches the contents from the 10th capture group, not the contents from the first capture group followed by the literal 0. If you want to match a literal number after the contents of the ℕth capture group, use one of the forms that has braces, brackets, or quotes, like \g{ℕ} or \k'ℕ' or \k<ℕ>: For example, \g{2}3 matches the contents of the second capture group, followed by a literal 3, whereas \g23 would match the contents of the twenty-third capture group. For clarity, it is highly recommended to always use the braces or brackets form for multi-digit ℕ For negative ℕ, groups are counted backwards relative to the last group, so that \g{-1} is the last matched group, and \g{-2} is the next-to-last matched group. Please, note the difference between absolute and relative backreferences. For instance, an exact four-letters word palindrome can be matched with : the regex (?-i)\b(\w)(\w)\g{2}\g{1}\b, when using absolute (positive) coordinates the regex (?-i)\b(\w)(\w)\g{-1}\g{-2}\b, when using relative (negative) coordinates \g{name}, \g<name>, \g'name', \k{name}, \k<name> or \k'name' ⇒ Named Backreference: The string matching the subexpression named name. (As with the Numbered Backreferences above, these Named Backreferences are used to refer to the capture group contents only in the search/match expression; see the Substitution Escape Sequences for how to refer to capture groups in substitutions/replacements.)
regular expression