From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from out.smtp-auth.no-ip.com (smtp-auth.no-ip.com [8.23.224.61]) by hurricane.the-brannons.com (Postfix) with ESMTPS id 18ED57890C for ; Fri, 28 Aug 2015 18:43:20 -0700 (PDT) X-No-IP: carhart.net@noip-smtp X-Report-Spam-To: abuse@no-ip.com Received: from carhart.net (unknown [99.52.200.227]) (Authenticated sender: carhart.net@noip-smtp) by smtp-auth.no-ip.com (Postfix) with ESMTPA id 5A37A400EBD; Fri, 28 Aug 2015 18:45:30 -0700 (PDT) Received: from kevc (carhart.net [192.168.1.179]) by carhart.net (8.13.8/8.13.8) with ESMTP id t7T1jTQx013560; Fri, 28 Aug 2015 18:45:29 -0700 To: chris@the-brannons.com, kevin@carhart.net, Edbrowse-dev@lists.the-brannons.com From: Kevin Carhart Reply-to: Kevin Carhart User-Agent: edbrowse/3.5.4.2 Date: Fri, 28 Aug 2015 18:45:29 -0700 Message-ID: <20150728184529.kevin@carhart.net > Mime-Version: 1.0 Content-Type: multipart/mixed; boundary=nextpart-eb-127385 Content-Transfer-Encoding: 7bit Subject: [Edbrowse-dev] tidy tree X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Aug 2015 01:43:20 -0000 This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --nextpart-eb-127385 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Well, that took a while but I have the contents of text nodes now = showing if you are at db6. - I bring in the tidybuffio.h so that I can make a TidyBuffer - I bring in a TidyBuffer because the tidyNodeGetText routine puts its = output in one. - Unlike tidyDoc, it seems as though in order to free a TidyBuffer, you = must test for null. Otherwise the program seg faults, based on a thread I was reading. =20 I have a phrasing at the end of the routine for how to know whether it = is safe to call tidyBufFree. I test the .size. I'm not sure if this is correct- does anyone know? = It wouldn't let me compare the TidyBuffer object with null. - The mechanics of the node traversal, I brought in from the Tidy = example as well as other snippets of work. =20 So I introduced two routines from the sample code: dumpBody and = dumpNode. =20 I placed them just after encodeTags. - Why do they hardcode those first several cases when they are = switching on the node name? I assume it is something to do with the laws of the W3C spec? Like, are these branches that terminate, so you don't have to worry = about additional levels? Anyway, I left them alone. Over to you! I'm sure this will raise some fertile issues in what to = do from here. I hope there will not be \r introduced into this attachment. If there = is, the email client is ruled out as a culprit, and I'll worry about = other causes. thanks Kevin --nextpart-eb-127385 Content-Type: text/plain; name="/home/kevin/public_html/c7/edbrowse/KC_patch20150828.txt" Content-Transfer-Encoding: 7bit diff -Naur 1/edbrowse-master/src/html.c 2/edbrowse-master/src/html.c --- 1/edbrowse-master/src/html.c 2015-08-27 14:18:35.000000000 -0700 +++ 2/edbrowse-master/src/html.c 2015-08-28 17:59:09.092626328 -0700 @@ -5,7 +5,7 @@ #include "eb.h" #include "tidy.h" - +#include "tidybuffio.h" #define handlerPresent(obj, name) (has_property(obj, name) == EJ_PROP_FUNCTION) static TidyDoc tdoc; @@ -1695,6 +1695,10 @@ showTidyMessages = false; tidySetCharEncoding(tdoc, (cons_utf8 ? "utf8" : "latin1")); tidyParseString(tdoc, html); + if (debugLevel >= 5) { + tidyCleanAndRepair(tdoc); + dumpBody(tdoc); + } ns = initString(&ns_l); preamble = initString(&preamble_l); @@ -2641,6 +2645,88 @@ return ns; } /* encodeTags */ +void dumpBody(TidyDoc tdoc) +{ +/* just for debugging - we only reach this routine at db5 or above */ + dumpNode(tidyGetBody(tdoc), 0); +} + +void dumpNode(TidyNode tnod, int indent) +{ +/* just for debugging - we only reach this routine at db5 or above */ + TidyNode child; + TidyBuffer tnv = { 0 }; /* text-node value */ + for (child = tidyGetChild(tnod); child; child = tidyGetNext(child)) { + ctmbstr name; + tidyBufClear(&tnv); + switch (tidyNodeGetType(child)) { + case TidyNode_Root: + name = "Root"; + break; + case TidyNode_DocType: + name = "DOCTYPE"; + break; + case TidyNode_Comment: + name = "Comment"; + break; + case TidyNode_ProcIns: + name = "Processing Instruction"; + break; + case TidyNode_Text: + name = "Text"; + break; + case TidyNode_CDATA: + name = "CDATA"; + break; + case TidyNode_Section: + name = "XML Section"; + break; + case TidyNode_Asp: + name = "ASP"; + break; + case TidyNode_Jste: + name = "JSTE"; + break; + case TidyNode_Php: + name = "PHP"; + break; + case TidyNode_XmlDecl: + name = "XML Declaration"; + break; + case TidyNode_Start: + case TidyNode_End: + case TidyNode_StartEnd: + default: + name = tidyNodeGetName(child); + break; + } + assert(name != NULL); + printf("Node(%d): %s\n", (indent / 4), ((char *)name)); + if (debugLevel >= 6) { +/* the ifs could be combined with && */ + if (strcmp(((char *)name), "Text") == 0) { + tidyNodeGetText(tdoc, child, &tnv); + printf("Text: %s", tnv.bp); +/* no trailing newline because it appears that there already is one */ + } + } + +/* Get the first attribute for all nodes */ + TidyAttr tattr = tidyAttrFirst(child); + while (tattr != NULL) { +/* Print the node and its attribute */ + printf("Attribute: %s = %s\n", tidyAttrName(tattr), + tidyAttrValue(tattr)); +/* Get the next attribute */ + tattr = tidyAttrNext(tattr); + } + dumpNode(child, indent + 4); + } + if (tnv.size > 0) { + tidyBufFree(&tnv); + } +} + void preFormatCheck(int tagno, bool * pretag, bool * slash) { const struct htmlTag *t; --nextpart-eb-127385--