How To Parse Cdata Html-content Of Xml Using Simplexml?
Solution 1:
I once answered it but I don't find the answer any longer.
If you take a look at the string (simplified/beautified):
<content:encoded><![CDATA[
<p>Lorem Ipsom</p>
<p>
<a href='laura-bertram-trance-gemini-145-1080.jpg'
title='<br>November 2012 calendar from 5.10 The Test<br> <a href="</a>
</p>]]>
</content:encoded>
You can see that you have HTML encoded inside the node-value of the <content:encoded>
element. So first you need to obtain the HTML value, which you already do:
$html = $boo->children('content', true)->encoded;
Then you need to parse the HTML inside $html
. With which libraries HTML parsing can be done with PHP is outlined in:
If you decide to use the more or less recommended DOMDocument
for the job, you only need to get the attribute value of a certain element:
Or for its sister library SimpleXML you already use (so this is more recommended, see as well the next section):
In context of your question here the following tip:
You're using SimpleXML. DOMDocument is a sister-library, meaning you can interchange between the two so you don't need to learn a full new library.
For example, you can use only the HTML parsing feature of DOMDocument
, but import it then into SimpleXML
. This is useful, because SimpleXML does not support HTML parsing.
That works via simplexml_import_dom()
.
A simplified step-by-step example:
// get the HTML string out of the feed:$htmlString = $boo->children('content', true)->encoded;
// create DOMDocument for HTML parsing:$htmlParser = new DOMDocument();
// load the HTML:$htmlParser->loadHTML($htmlString);
// import it into simplexml:$html = simplexml_import_dom($htmlParser);
Now you can use $html
as a new SimpleXMLElement that represents the HTML document. As your HTML chunks did not have any <body>
tags, according to the HTML specification, they are put inside the <body>
tag. This will allow you for example to access the href
attribute of the first <a>
inside the second <p>
element in your example:#
// access the element you're looking for:
$href = $html->body->p[1]->a['href'];
Here the full view from above (Online Demo):
// get the HTML string out of the feed:$htmlString = $boo->children('content', true)->encoded;
// create DOMDocument for HTML parsing:$htmlParser = new DOMDocument();
// your HTML gives parser warnings, keep them internal:
libxml_use_internal_errors(true);
// load the HTML:$htmlParser->loadHTML($htmlString);
// import it into simplexml:$html = simplexml_import_dom($htmlParser);
// access the element you're looking for:$href = $html->body->p[1]->a['href'];
// output itecho$href, "\n";
And what it outputs:
laura-bertram-trance-gemini-145-1080.jpg
Solution 2:
you would need to parse the image url eg via preg_match
and this regex '(http://(?:[^']*))'
Post a Comment for "How To Parse Cdata Html-content Of Xml Using Simplexml?"