Table of contents

Get text between <tag> and </tag> with PREG in PHP

PHP Sep 01, 2020 Viewed 1.3K Comments 0

When we make a simple web crawler in PHP, we often need to extract HTML tags. We can use some third party libraries, such as PHP Simple HTML DOM Parser. You can also do with XML Manipulation. This article describes to match tags with PCRE patterns.

Single-line

For example, there is a html likes:

<div>
    <span id="name1" data="data1">value1</span>
    <span id="name2" data="data2">value2</span>
    <span id="name3" data="data3">value3</span>
</div>

Using Dot, we can match the text of the first span tag.

$str = <<<EOF
<div>
    <span id="name1" data="data1">value1</span>
    <span id="name2" data="data2">value2</span>
    <span id="name3" data="data3">value3</span>
</div>
EOF;
preg_match('/<span id="name1"[^>]*>(.*)<\/span>/', $str, $matches);
print_r($matches);

Output:

Array
(
    [0] => <span id="name1" data="data1">value1</span>
    [1] => value1
)

Multi-line

For example, there is a html likes:

<div>
    <span id="name1" data="data1">
        value1
    </span>
    <span id="name2" data="data2">
        value2
    </span>
    <span id="name3" data="data3">
        value3
    </span>
</div>

Using the s (PCRE_DOTALL) and ? (Meta-character), we can match the text of the first span tag.

s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

? extends the meaning of (, also 0 or 1 quantifier, also makes greedy quantifiers lazy.

$str = <<<EOF
<div>
    <span id="name1" data="data1">
        value1
    </span>
    <span id="name2" data="data2">
        value2
    </span>
    <span id="name3" data="data3">
        value3
    </span>
</div>
EOF;
preg_match('/<span id="name1"[^>]*>(.*?)<\/span>/s', $str, $matches);
print_r($matches);

Output

Array
(
    [0] => <span id="name1" data="data1">
        value1
    </span>
    [1] => 
        value1
    
)
Updated Sep 01, 2020