HTM2ASC Documentation 

By Scott M. Baker (smbaker@primenet.com) 

-------------------------------------------------------------------------------

I'm not big on documentation, so don't expect to find a whole lot here. HTM2ASC 
is designed to perform a relatively straightforward task, so there's not a 
whole lot to say!

-------------------------------------------------------------------------------

Purpose: To convert an HTML document to plain ASCII text while preserving as 
much of the original formatting as possible. 

Syntax: HTM2ASC <filename.htm> 

Optional Parameters:
Ŀ
Switch Parameter                                                           
Ĵ
-lxxx  specify initial left margin (default is 0)                          
Ĵ
-rxxx  specify initial right margin (default is 79)                        
Ĵ
-t     dump HTML tree to stdout                                            
Ĵ
-h     disable high ascii use (i.e. tables)                                
Ĵ
-c     display html tag count statistics to stdout                         
Ĵ
-x     do not write output file                                            
Ĵ
-u     dump all HTML tags found in anchors to stdout                       
Ĵ
-uf    dump only full HTML tags found in anchors to stdout (full tags start
       with http://)                                                       
Ĵ
-w     (Windows executable only) close the main window open after program  
       terminates                                                          
Ĵ
-d     Disable horizontal lines between table rows                         


Distribution: The distribution archive will be named SBH2Axxx.ZIP, where xxx is 
the current version number. The following files should be present in the 
archive: 
Ŀ
htm2asc.exe   DOS executable file                  
Ĵ
htm2ascw.exe  Windows (3.1) executable file        
Ĵ
htm2asc.htm   Documentation in HTML format         
Ĵ
htm2asc.txt   Documentation in plain ASCII format  


-------------------------------------------------------------------------------

Some comments.... 

The reason I wrote HTM2ASC is because I wanted to shift my documentation 
efforts to HTML, due to it's benefits over ascii text. However, there still 
existed a need to support users who did not have HTML support available. Rather 
than update two documents at the same time, my decision was to do all my 
writing in HTML and then use a converter program of some sort to convert it to 
ASCII. 

I was unable to find an existing utility that adequatly handled the job, so I 
took on the project of writing my own. My understanding of HTML is still very 
basic, but at this point, HTM2ASC handles my documentation reasonably well. 

I have the capability of producing both 16-bit DOS DPMI and 16-bit OS/2 
character mode executables if needed. These would have the benefit of larger 
available memory and probably support larger HTML documents. If there is 
demand, I can do it... 

-------------------------------------------------------------------------------

The Windows Version... 

The windows version uses Borland Pascals WinCRT unit, which is basically a 
simple emulation of the dos interface. The only real advantage to using the 
windows version is additional memory availability and perhaps better 
cooperation with other applications. 

By default, the windows version will leave an "inactive" window onscreen with 
the program results displayed on it. If you want the window to automatically 
"go-away" then use the -w switch. This should be good for batch operations. 

-------------------------------------------------------------------------------

How HTM2ASC deals with some common HTML constructs: 

    * Lists. Bulleted and ordered lists (UL and OL) are easy to handle. I just 
      indent each element in six spaces, while placing the appropriate bullet 
      in that six space area. Definition lists are similar - <DT>'s all are 
      placed at the current left margin and <DD>'s are placed 4 spaces in. 
    * Tables. I have put forth a reasonable effort to handle basic tables. 
      Tables are by far the most complex HTML element that i've had to deal 
      with, and as such, are the most likely to not work correctly. 
    * Anchors. Anchors have no use whatsoever in an ASCII document and are 
      therefore ignored completely. 
    * Type Styles: Bold (B), Italics (I), etc. Since there is no way to render 
      these in ASCII, they are ignored. 
    * Headings: <H1>, <H2>, etc. There's no way to do these in ASCII, so the 
      formatting is ignored. Each heading will always start on an empty line. 
    * Horizontal Rule: <HR> This will be emitted as a series of dashes (-) from 
      the right to left margin. 
    * Images. Images are completely ignored. 
    * Preformatted text <PRE>. Any newlines (ascii #10) are translated as line 
      breaks. Multiple spaces are preserved. <P>'s are ignored. 
    * Blockquote <BR>. The left and right margins are each indented 4 spaces. 
    * Escaped characters: 
      Ŀ
      Escape Sequence  Translation  
      Ĵ
      &lt;             <            
      Ĵ
      &gt;             >            
      Ĵ
      &amp;            &            
      

-------------------------------------------------------------------------------

Some implementation details: 

HTM2ASC works by reading the entire ASCII file into memory and creating a tree 
structure to hold the data. This has the side effect of consuming up a large 
amount of memory, and may cause trouble on extremely large HTML documents. None 
of my own documents are large enough to present a problem yet, so I'm not 
worrying about this at this time. 

Each HTML tag will be a seperate node in this tree. Each bundle of text will 
also be a seperate node. When text is read in, duplicate spaces are eliminated 
and carriage returns and line feeds are completely ignored. 

The largest contiguous text block (i.e. a text block with no HTML commands in 
it) that HTM2ASC can handle is 64k. This could present a problem in large 
documents. 

When the tree is processed and written back out, a virtual page is used to hold 
the text as it is being formatted. This was the only way I could figure out how 
to handle tables. The virtual page is represented as a circular array/queue. 
When it fills up, old lines are dumped out to disk. Sometimes the page will be 
backtracked, as in the case of tables where table cells span multiple text 
lines. 

-------------------------------------------------------------------------------

Revision History 

Version 1.01: 

    * Added -c (HTML tag count), -x (no output) switches 
    * Support for <BLOCKQUOTE> 
    * Better handling of blank lines between paragraphs, etc. 

Version 1.02: 

    * Windows version (HTM2ASCW.EXE) 
    * Added -w to make mainwindow go-away in windows version 
    * Support for <CENTER> and paragraph alignment tags in <P> 
    * Added horizontal lines between table rows 
    * Added -D option to disable horizontal lines between table rows 
    * Fixed problem with <TH> nodes 
    * Rewrote wordwrap to better deal with leading and trailing spaces around 
      character formatting tags. 

-------------------------------------------------------------------------------

Is this program freeware? Shareware? or what? 

If there is enough positive response, I will probably make it into a shareware 
program. In the meantime, it's primary purpose is to handle my own 
documentation, and all of my development efforts will be centered on making it 
work for my own needs, rather than what other people want. 

-------------------------------------------------------------------------------

How to contact me: 

US-Mail: 

Scott M. Baker
2241 W Labriego
Tucson, Az 85741

My Bulletin board: 

The Not-Yet-Named BBS
(520) 544-4655 (USR Dual 14.4k)
(520) 797-8573 (USR Sportster 28.8k)

Email: 

smbaker@primenet.com

My Homepage: 

http://www.primenet.com/~smbaker

