[FIXED] PHP Regex Pattern To Match Emoji Unicode

Issue

I use this pattern

preg_match_all( "/'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^ \s\p{L}\p{N}]+|\s+(?!\S)|\s+/", $text, $matches );

To tokenize the contents of the $text variable…

$text Contents of variable: "Hello!! I am Sajad Hossein Sagor. It is the year 2023. w00t :D \ud83e\udd17"

Here \ud83e\udd17 This is the encoded emoji unicode for 🤗 and I want to capture it as one match, but using the above pattern, these unicodes are separated separately matches…

See output below…

 array (size=23)
  0 => String 'Hello' (length = 5)
  1 => string '!!' (length=2)
  2 => string 'I' (length = 2)
  3 => string ''m' (length = 2)
  4 => string 'Sajjad' (length = 7)
  5 => string 'Hossain' (length = 8)
  6 => String 'Sagore' (length = 6)
  7 => string '.' (length=1)
  8 => string 'it' (length = 3)
  9 => string ''s' (length = 2)
  10 => string '2023' (length = 5)
  11 => string '.' (length=1)
  12 => string 'w' (length = 2)
  13 => string '00' (length = 2)
  14 => string 't' (length = 1)
  15 => string ':' (length = 2)
  16 => string 'D' (length = 1)
  17 => string '\' (length = 2)
  18 => string 'ud' (length = 2)
  19 => string '83' (length = 2)
  20 => string 'e' (length = 1)
  21 => string '\' (length = 1)
  22 => string 'udd' (length = 3)
  23 => string '17' (length = 2)
 

How to change the above pattern to capture these unicode as one captcha!!Thanks!!

Solution

You might use

(?:\\u[a-f0-9]+)+|'[stdm]|'[rv]e|'ll| ?\p{L}+| ?\p{N}+| ?(?!\\u[a-f0-9]+\b)[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

See a PHP demo and a regex demo.

$text = "Hello!! I'm Sajjad Hossain Sagor. It's 2023. w00t :D \ud83e\udd17";
$pattern = "/(?:\\\\u[a-f0-9]+)+|'[stdm]|'[rv]e|'ll| ?\p{L}+| ?\p{N}+| ?(?!\\\\u[a-f0-9]+\b)[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/";
preg_match_all(
    $pattern,
    $text, 
    $matches
);
var_dump($matches[0]);

Output

array(19) {
  [0]=>
  string(5) "Hello"
  [1]=>
  string(2) "!!"
  [2]=>
  string(2) " I"
  [3]=>
  string(2) "'m"
  [4]=>
  string(7) " Sajjad"
  [5]=>
  string(8) " Hossain"
  [6]=>
  string(6) " Sagor"
  [7]=>
  string(1) "."
  [8]=>
  string(3) " It"
  [9]=>
  string(2) "'s"
  [10]=>
  string(5) " 2023"
  [11]=>
  string(1) "."
  [12]=>
  string(2) " w"
  [13]=>
  string(2) "00"
  [14]=>
  string(1) "t"
  [15]=>
  string(2) " :"
  [16]=>
  string(1) "D"
  [17]=>
  string(1) " "
  [18]=>
  string(12) "\ud83e\udd17"
}

Answered By – The fourth bird

Answer Checked By – David Goodson (Easybugfix Volunteer)

Leave a Reply

(*) Required, Your email will not be published