How To Parse Xml With Nokogiri Without Losing Html Entities?
If you look at the output below in the after section ruby is removing all the html entities. How to parse XML with nokogiri without loosing HTML entities? --- BEFORE ---
Solution 1:
Your test file might have some invalid HTML entities.
nokogiri.rb:
require 'nokogiri'
puts "--- INVALID ---"
invalid_xml = <<-XML
<blog:entryFull>invalid M&Ms</blog:entryFull><!-- invalid M and M's --><blog:entryFull><p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
XML
doc = Nokogiri::XML::DocumentFragment.parse(invalid_xml)
puts doc
puts "--- VALID ---"
valid_xml = <<-XML
<blog:entryFull>valid M&Ms</blog:entryFull><!-- valid M and M's --><blog:entryFull><p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
XML
doc = Nokogiri::XML::DocumentFragment.parse(valid_xml)
puts doc
result:
$ ruby nokogiri.rb
--- INVALID ---
<blog:entryFull>invalid M</blog:entryFull><!-- invalid M and M's --><blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
--- VALID ---
<blog:entryFull>valid M&Ms</blog:entryFull><!-- valid M and M's --><blog:entryFull><p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
so,
- Fix input XML
- Use STRICT ParseOptions
strict parsing example:
invalid_xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?><root><blog:entryFull>invalid M&Ms</blog:entryFull><blog:entryFull><p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull></root>
XML
begin
doc = Nokogiri::XML(invalid_xml) do |configure|
configure.strict # strict parsing
end
puts doc
rescue => e
puts 'INVALID XML'
end
Solution 2:
Qambar, I am unable to recreate your issue. However, I am able to produce your desired output given these files/input:
test.xml
<blog:entryFull> <p><iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
nokogiri.rb
require'nokogiri'
f = File.open("./test.html")
contents = ""
f.each {|line|
contents << line
}
puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"
doc = Nokogiri::XML::DocumentFragment.parse(contents)
puts doc.inner_html
f.close
Console
Development/Code » ruby nokogiri.rb
--- BEFORE ---
<blog:entryFull> <p><iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
--- AFTER ---
<blog:entryFull> <p><iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
Solution 3:
The work-around that i did was to fetch the xml tag through regex and then convert html entities using html entities. Then parse it with nokogiri html parser.
Post a Comment for "How To Parse Xml With Nokogiri Without Losing Html Entities?"