Issue
I use this pattern
preg_match_all( "/'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^ \s\p{L}\p{N}]+|\s+(?!\S)|\s+/", $text, $matches );
To tokenize the contents of the $text
variable…
$text
Contents of variable: "Hello!! I am Sajad Hossein Sagor. It is the year 2023. w00t :D \ud83e\udd17"
Here \ud83e\udd17
This is the encoded emoji unicode for 🤗 and I want to capture it as one match, but using the above pattern, these unicodes are separated separately matches…
See output below…
array (size=23)
0 => String 'Hello' (length = 5)
1 => string '!!' (length=2)
2 => string 'I' (length = 2)
3 => string ''m' (length = 2)
4 => string 'Sajjad' (length = 7)
5 => string 'Hossain' (length = 8)
6 => String 'Sagore' (length = 6)
7 => string '.' (length=1)
8 => string 'it' (length = 3)
9 => string ''s' (length = 2)
10 => string '2023' (length = 5)
11 => string '.' (length=1)
12 => string 'w' (length = 2)
13 => string '00' (length = 2)
14 => string 't' (length = 1)
15 => string ':' (length = 2)
16 => string 'D' (length = 1)
17 => string '\' (length = 2)
18 => string 'ud' (length = 2)
19 => string '83' (length = 2)
20 => string 'e' (length = 1)
21 => string '\' (length = 1)
22 => string 'udd' (length = 3)
23 => string '17' (length = 2)
How to change the above pattern to capture these unicode as one captcha!!Thanks!!
Solution
You might use
(?:\\u[a-f0-9]+)+|'[stdm]|'[rv]e|'ll| ?\p{L}+| ?\p{N}+| ?(?!\\u[a-f0-9]+\b)[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
See a PHP demo and a regex demo.
$text = "Hello!! I'm Sajjad Hossain Sagor. It's 2023. w00t :D \ud83e\udd17";
$pattern = "/(?:\\\\u[a-f0-9]+)+|'[stdm]|'[rv]e|'ll| ?\p{L}+| ?\p{N}+| ?(?!\\\\u[a-f0-9]+\b)[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/";
preg_match_all(
$pattern,
$text,
$matches
);
var_dump($matches[0]);
Output
array(19) {
[0]=>
string(5) "Hello"
[1]=>
string(2) "!!"
[2]=>
string(2) " I"
[3]=>
string(2) "'m"
[4]=>
string(7) " Sajjad"
[5]=>
string(8) " Hossain"
[6]=>
string(6) " Sagor"
[7]=>
string(1) "."
[8]=>
string(3) " It"
[9]=>
string(2) "'s"
[10]=>
string(5) " 2023"
[11]=>
string(1) "."
[12]=>
string(2) " w"
[13]=>
string(2) "00"
[14]=>
string(1) "t"
[15]=>
string(2) " :"
[16]=>
string(1) "D"
[17]=>
string(1) " "
[18]=>
string(12) "\ud83e\udd17"
}
Answered By – The fourth bird
Answer Checked By – David Goodson (Easybugfix Volunteer)