Document

The main document interface, including a html or xml parser.

There's three main ways to create a Document:

If you want to parse something and inspect the tags, you can use the constructor:

// create and parse some HTML in one call
auto document = new Document("<html></html>");

// or some XML
auto document = new Document("<xml></xml>", true, true); // strict mode enabled

// or better yet:
auto document = new XmlDocument("<xml></xml>"); // specialized subclass

If you want to download something and parse it in one call, the fromUrl static function can help:

auto document = Document.fromUrl("http://dlang.org/");

(note that this requires my arsd.characterencodings and arsd.http2 libraries)

And, if you need to inspect things like <%= foo %> tags and comments, you can add them to the dom like this, with the enableAddingSpecialTagsToDom and parseUtf8 or parseGarbage functions:

auto document = new Document();
document.enableAddingSpecialTagsToDom();
document.parseUtf8("<example></example>", true, true); // changes the trues to false to switch from xml to html mode

You can also modify things like selfClosedElements and rawSourceElements before calling the parse family of functions to do further advanced tasks.

However you parse it, it will put a few things into special variables.

root contains the root document. prolog contains the instructions before the root (like <!DOCTYPE html>). To keep the original things, you will need to enableAddingSpecialTagsToDom first, otherwise the library will return generic strings in there. piecesBeforeRoot will have other parsed instructions, if enableAddingSpecialTagsToDom is called. piecesAfterRoot will contain any xml-looking data after the root tag is closed.

Most often though, you will not need to look at any of that data, since Document itself has methods like querySelector, appendChild, and more which will forward to the root Element for you.

class Document : FileResource , DomParent {
string _contentType;
void delegate(DomMutationEvent)[] eventObservers;
}

Constructors

this
this(string data, bool caseSensitive, bool strict)

Creates a document with the given source data. If you want HTML behavior, use caseSensitive and struct set to false. For XML mode, set them to true.

this
this()

Creates an empty document. It has *nothing* in it at all, ready.

Members

Aliases

body
alias body = mainBody

This returns the <body> element, if there is one. (It different than Javascript, where it is called 'body', because body used to be a keyword in D.)

Functions

clear
void clear()

.

createElement
Element createElement(string name)

.

createForm
Form createForm()

.

createFragment
Element createFragment()

.

createTextNode
Element createTextNode(string content)

.

enableAddingSpecialTagsToDom
void enableAddingSpecialTagsToDom()

Adds objects to the dom representing things normally stripped out during the default parse, like comments, <!instructions>, <% code%>, and <? code?> all at once.

findFirst
Element findFirst(bool delegate(Element) doesItMatch)

.

forms
Form[] forms()

.

getData
immutable(ubyte)[] getData()

implementing the FileResource interface; it calls toString.

getElementById
Element getElementById(string id)
getElementsByClassName
Element[] getElementsByClassName(string tag)
getElementsBySelector
deprecated Element[] getElementsBySelector(string selector)
getElementsByTagName
Element[] getElementsByTagName(string tag)

These functions all forward to the root element. See the documentation in the Element class.

getFirstElementByTagName
Element getFirstElementByTagName(string tag)

FIXME: btw, this could just be a lazy range......

getMeta
string getMeta(string name)

this uses a weird thing... it's [name=] if no colon and [property=] if colon

mainBody
Element mainBody()

This returns the <body> element, if there is one. (It different than Javascript, where it is called 'body', because body used to be a keyword in D.)

opIndex
ElementCollection opIndex(string selector)

This is just something I'm toying with. Right now, you use opIndex to put in css selectors. It returns a struct that forwards calls to all elements it holds, and returns itself so you can chain it.

optionSelector
MaybeNullElement!SomeElementType optionSelector(string selector, string file, size_t line)

These functions all forward to the root element. See the documentation in the Element class.

parse
void parse(string rawdata, bool caseSensitive, bool strict, string dataEncoding)

Take XMLish data and try to make the DOM tree out of it.

parseGarbage
void parseGarbage(string data)

Given the kind of garbage you find on the Internet, try to make sense of it. Equivalent to document.parse(data, false, false, null); (Case-insensitive, non-strict, determine character encoding from the data.) NOTE: this makes no attempt at added security, but it will try to recover from anything instead of throwing.

parseStrict
void parseStrict(string data, bool pureXmlMode)

Parses well-formed UTF-8, case-sensitive, XML or XHTML Will throw exceptions on things like unclosed tags.

parseUtf8
void parseUtf8(string data, bool caseSensitive, bool strict)

Parses well-formed UTF-8 in loose mode (by default). Tries to correct tag soup, but does NOT try to correct bad character encodings.

querySelector
Element querySelector(string selector)
querySelectorAll
Element[] querySelectorAll(string selector)
requireElementById
SomeElementType requireElementById(string id, string file, size_t line)
requireSelector
SomeElementType requireSelector(string selector, string file, size_t line)

These functions all forward to the root element. See the documentation in the Element class.

setMeta
void setMeta(string name, string value)

Sets a meta tag in the document header. It is kinda hacky to work easily for both Facebook open graph and traditional html meta tags/

setProlog
void setProlog(string d)

Returns or sets the string before the root element. This is, for example, <!DOCTYPE html>\n or similar.

toPrettyString
string toPrettyString(bool insertComments, int indentationLevel, string indentWith)

Writes it out with whitespace for easier eyeball debugging

toString
string toString()

Returns the document as string form. Please note that if there is anything in piecesAfterRoot, they are discarded. If you want to add them to the file, loop over that and append it yourself (but remember xml isn't supposed to have anything after the root element).

Properties

contentType
string contentType [@property setter]

If you're using this for some other kind of XML, you can set the content type here.

contentType
string contentType [@property getter]

implementing the FileResource interface, useful for sending via http automatically.

filename
string filename [@property getter]

implementing the FileResource interface, useful for sending via http automatically.

prolog
string prolog [@property getter]

Returns or sets the string before the root element. This is, for example, <!DOCTYPE html>\n or similar.

title
string title [@property getter]

Gets the <title> element's innerText, if one exists

title
string title [@property setter]

Sets the title of the page, creating a <title> element if needed.

Static functions

fromUrl
Document fromUrl(string url, bool strictMode)

Convenience method for web scraping. Requires arsd.http2 to be included in the build as well as arsd.characterencodings.

Variables

inlineElements
immutable(string)[] inlineElements;

List of elements that are considered inline for pretty printing. The default for a Document are hard-coded to something appropriate for HTML. For XmlDocument, it defaults to empty. You can modify this after construction but before parsing.

loose
bool loose;

.

parseSawAspCode
bool delegate(string) parseSawAspCode;

If the parser sees <% asp code... %>, it will call this callback. It will be passed "% asp code... %" or "%= asp code .. %" Return true if you want the node appended to the document. It will be in an AspCode object.

parseSawBangInstruction
bool delegate(string) parseSawBangInstruction;

if it sees a <! that is not CDATA or comment (CDATA is handled automatically and comments call parseSawComment), it calls this function with the contents. <!SOMETHING foo> calls parseSawBangInstruction("SOMETHING foo") Return true if you want the node appended to the document. It will be in a BangInstruction object.

parseSawComment
bool delegate(string) parseSawComment;

If the parser sees a html comment, it will call this callback <!-- comment --> will call parseSawComment(" comment ") Return true if you want the node appended to the document. It will be in a HtmlComment object.

parseSawPhpCode
bool delegate(string) parseSawPhpCode;

If the parser sees <?php php code... ?>, it will call this callback. It will be passed "?php php code... ?" or "?= asp code .. ?" Note: dom.d cannot identify the other php <? code ?> short format. Return true if you want the node appended to the document. It will be in a PhpCode object.

parseSawQuestionInstruction
bool delegate(string) parseSawQuestionInstruction;

if it sees a <?xxx> that is not php or asp it calls this function with the contents. <?SOMETHING foo> calls parseSawQuestionInstruction("?SOMETHING foo") Unlike the php/asp ones, this ends on the first > it sees, without requiring ?>. Return true if you want the node appended to the document. It will be in a QuestionInstruction object.

piecesAfterRoot
Element[] piecesAfterRoot;

stuff after the root, only stored in non-strict mode and not used in toString, but available in case you want it

piecesBeforeRoot
Element[] piecesBeforeRoot;

if these were kept, this is stuff that appeared before the root element, such as <?xml version ?> decls and <!DOCTYPE>s

rawSourceElements
immutable(string)[] rawSourceElements;

List of elements that contain raw CDATA content for this document, e.g. <script> and <style> for HTML. The parser will read until the closing string and put everything else in a RawSource object for future processing, not trying to do any further child nodes or attributes, etc.

root
Element root;

The root element, like <html>. Most the methods on Document forward to this object.

selfClosedElements
immutable(string)[] selfClosedElements;

List of elements that can be assumed to be self-closed in this document. The default for a Document are a hard-coded list of ones appropriate for HTML. For XmlDocument, it defaults to empty. You can modify this after construction but before parsing.

Inherited Members

From FileResource

contentType
string contentType [@property getter]

the content-type of the file. e.g. "text/html; charset=utf-8" or "image/png"

getData
immutable(ubyte)[] getData()

the data

filename
string filename [@property getter]

filename, return null if none

Examples

Basic parsing of HTML tag soup

If you simply make a new Document("some string") or use Document.fromUrl to automatically download a page (that's function is shorthand for new Document(arsd.http2.get(your_given_url).contentText)), the Document parser will assume it is broken HTML. It will try to fix up things like charset messes, missing closing tags, flipped tags, inconsistent letter cases, and other forms of commonly found HTML on the web.

It isn't exactly the same as what a HTML5 web browser does in all cases, but it usually it, and where it disagrees, it is still usually good enough (but sometimes a bug).

auto document = new Document(`<html><body><p>hello <P>there`);
// this will automatically try to normalize the html and fix up broken tags, etc
// so notice how it added the missing closing tags here and made them all lower case
assert(document.toString() == "<!DOCTYPE html>\n<html><body><p>hello </p><p>there</p></body></html>", document.toString());

Stricter parsing of HTML

When you are writing the HTML yourself, you can remove most ambiguity by making it throw exceptions instead of trying to automatically fix up things basic parsing tries to do. Using strict mode accomplishes this.

This will help guarantee that you have well-formed HTML, which means it is going to parse a lot more reliably by all users - browsers, dom.d, other libraries, all behave better with well-formed input... people too!

(note it is not a full *validator*, just a well-formedness checker. Full validation is a lot more work for very little benefit in my experience, so I stopped here.)

try {
	auto document = new Document(`<html><body><p>hello <P>there`, true, true); // turns on strict and case sensitive mode to ctor
	assert(0); // never reached, the constructor will throw because strict mode is turned on
} catch(Exception e) {

}

// you can also create the object first, then use the [parseStrict] method
auto document = new Document;
document.parseStrict(`<foo></foo>`); // this is invalid html - no such foo tag - but it is well-formed, since it is opened and closed properly, so it passes

Custom HTML extensions

dom.d is a custom HTML parser, which means you can add custom HTML extensions to it too. It normally reads and discards things like ASP style <% ... %> code as well as XML processing instruction / PHP style embeds <? ... ?> but you can keep this data if you call a function to opt into it in before parsing.

Additionally, you can add special tags to be read like <script> to preserve its insides for future processing via the .innerRawSource member.

auto document = new Document; // construct an empty thing first
document.enableAddingSpecialTagsToDom(); // add the special tags like <% ... %> etc
document.rawSourceElements ~= "embedded-plaintext"; // tell it we want a custom

document.parseStrict(`<html>
	<% some asp code %>
	<script>embedded && javascript</script>
	<embedded-plaintext>my <custom> plaintext & stuff</embedded-plaintext>
</html>`);

// please note that if we did `document.toString()` right now, the original source - almost your same
// string you passed to parseStrict - would be spit back out. Meaning the embedded-plaintext still has its
// special text inside it. Another parser won't understand how to use this! So if you want to pass this
// document somewhere else, you need to do some transformations.
//
// This differs from cases like CDATA sections, which dom.d will automatically convert into plain html entities
// on the output that can be read by anyone.

assert(document.root.tagName == "html"); // the root element is normal

int foundCount;
// now let's loop through the whole tree
foreach(element; document.root.tree) {
	// the asp thing will be in
	if(auto asp = cast(AspCode) element) {
		// you use the `asp.source` member to get the code for these
		assert(asp.source == "% some asp code %");
		foundCount++;
	} else if(element.tagName == "script") {
		// and for raw source elements - script, style, or the ones you add,
		// you use the innerHTML method to get the code inside
		assert(element.innerHTML == "embedded && javascript");
		foundCount++;
	} else if(element.tagName == "embedded-plaintext") {
		// and innerHTML again
		assert(element.innerHTML == "my <custom> plaintext & stuff");
		foundCount++;
	}

}

assert(foundCount == 3);

// writeln(document.toString());

Demoing CDATA, entities, and non-ascii characters.

The previous example mentioned CDATA, let's show you what that does too. These are all read in as plain strings accessible in the DOM - there is no CDATA, no entities once you get inside the object model - but when you convert back into a string, it will normalize them in a particular way.

This is not exactly standards compliant completely in and out thanks to it doing some transformations... but I find it more useful - it reads the data in consistently and writes it out consistently, both in ways that work well for interop. Take a look:

auto document = new Document(`<html>
	<p>¤ is a non-ascii character. It will be converted to a numbered entity in string output.</p>
	<p>&curren; is the same thing, but as a named entity. It also will be changed to a numbered entity in string output.</p>
	<p><![CDATA[xml cdata segments, which can contain <tag> looking things, are converted to encode the embedded special-to-xml characters to entities too.]]></p>
</html>`, true, true); // strict mode turned on

// Inside the object model, things are simplified to D strings.
auto paragraphs = document.querySelectorAll("p");
// no surprise on the first paragraph, we wrote it with the character, and it is still there in the D string
assert(paragraphs[0].textContent == "¤ is a non-ascii character. It will be converted to a numbered entity in string output.");
// but note on the second paragraph, the entity has been converted to the appropriate *character* in the object
assert(paragraphs[1].textContent == "¤ is the same thing, but as a named entity. It also will be changed to a numbered entity in string output.");
// and the CDATA bit is completely gone from the DOM; it just read it in as a text node. The txt content shows the text as a plain string:
assert(paragraphs[2].textContent == "xml cdata segments, which can contain <tag> looking things, are converted to encode the embedded special-to-xml characters to entities too.");
// and the dom node beneath it is just a single text node; no trace of the original CDATA detail is left after parsing.
assert(paragraphs[2].childNodes.length == 1 && paragraphs[2].childNodes[0].nodeType == NodeType.Text);

// And now, in the output string, we can see they are normalized thusly:
assert(document.toString() == "<!DOCTYPE html>\n<html>
	<p>&#164; is a non-ascii character. It will be converted to a numbered entity in string output.</p>
	<p>&#164; is the same thing, but as a named entity. It also will be changed to a numbered entity in string output.</p>
	<p>xml cdata segments, which can contain &lt;tag&gt; looking things, are converted to encode the embedded special-to-xml characters to entities too.</p>
</html>");

Streaming parsing

dom.d normally takes a big string and returns a big DOM object tree - hence its name. This is usually the simplest code to read and write, so I prefer to stick to that, but if you wanna jump through a few hoops, you can still make dom.d work with streams.

It is awkward - again, dom.d's whole design is based on building the dom tree, but you can do it if you're willing to subclass a little and trust the garbage collector. Here's how.

bool encountered;
class StreamDocument : Document {
	// the normal behavior for this function is to `parent.appendChild(child)`
	// but we can override to read it as it is processed and not append it
	override void processNodeWhileParsing(Element parent, Element child) {
		if(child.tagName == "bar")
			encountered = true;
		// note that each element's object is created but then discarded as garbage.
		// the GC will take care of it, even with a large document, whereas the normal
		// object tree could become quite large.
	}

	this() {
		super("<foo><bar></bar></foo>");
	}
}

auto test = new StreamDocument();
assert(encountered); // it should have been seen
assert(test.querySelector("bar") is null); // but not appended to the dom node, since we didn't append it

Basic parsing of XML.

dom.d is not technically a standards-compliant xml parser and doesn't implement all xml features, but its stricter parse options together with turning off HTML's special tag handling (e.g. treating <script> and <style> the same as any other tag) gets close enough to work fine for a great many use cases.

For more information, see XmlDocument.

auto xml = new XmlDocument(`<my-stuff>hello</my-stuff>`);

See Also

Meta