Isaaclw.com - cmcdwnlr2: A Perl Comic Downloader

Since I'm still on free hosting, I couldn't send you a direct link.
This is a Perl Comic Downloader: or cmcdwnlr.
Note: This file may have been discontinued. I suggest you try out WebComic Reader, a Greasemoneky Script.
Behold! The download links:
download (bz2)
download (zip)
Last updated:
The Readme follows:
'cmcdwnlr' is a Tool that parses through comic sites and
downloads new comics.
It is released under the GNU GPL:
    http://www.gnu.org/licenses/gpl-3.0.txt


TOC:
> Usage
> Goals
> Dependencies
> Modes
> Configuring
> Conf file
> 'url' expressions
> 'pic'/'next' expressions
> Acting on URLS
> Start of Downloading
> Tricks in Downloading
> End of Downloading
> Extras
> Bug Reports


==Usage==
/path/to/script [COMIC_NAME]
COMIC_NAME must be an entry in the conf file.
If COMIC_NAME is given, it will only act on that one
comic.
If it is not given, then all the comics will be checked.


==Goals==
The script is targeting those that have sporadic, or
occasional bandwidth. This means that since the program
runs for so long, it is expected to be resumable.
For example, the user must be able to kill the program,
and resume is where it left off.
It also means that as few downloads are done as possible,
to achieve the most number of downloaded comics.


==Dependencies==
-wget
-imagemagick
-perl (of course)
I considered ditching the wget dependency, but found that
wget is so much better then built in perl downloading.


==Modes==
It has two modes:
- Counting mode, where the site url or image contains an
    incrementing number
- Full Parsing mode, where the site has a "next" or "back"
    button.
Counting mode is default unless the "next" value is set in
the conf file
In Counting mode, the files are grabbed in two ways:
- Direct download.
    If the image is stored online increments, then the
    direct link to the image can be listed by itself in
    the conf file.
    This means that the "title" text (or mouse over text)
    can not be gathered.
- Indirect download.
    If the url where the image is stored incrementing
    format then a regular expression 'pic' can be set, to
    catch the image in the page.
    With this option, you can also set "get_text" in order
    to capture the title text/mouse over text of the
    image.
Counting Mode is 'cleaner' with it's output then Parsing
mode because it has more control over the naming, but
quite a bit more complicated. Sometimes comics have
missing files because they don't count perfectly. See the
'skip_ahead' option in the conf file.
In Full Parsing mode, the files are grabbed by starting at
the url specified and crawling through the site by finding
a link. This means you need to set two regular
expressions:
- The regular expression for the picture
- The regular expression that allows the parser to find
    the next link.
Note: if you run Full Parsing Mode for a specific comic,
and don't finish, it will pick up where it left off based
off the "next_url" file in that comic folder.


==Configuring==
If you use Ubuntu, simple configure two files:
    1) /etc/cmcdwnlr2/download
        Write on the first line, the path to the folder
        where the comics will be stored.
        Anything on subsequent lines will be ignored.
    2) /etc/cmcdwnlr2/comcis
        This is the configuration for each comic.
        See the 'Conf file' for more details.

Otherwise, simply open up the 'cmcdwnlr2.pl' file and
edit/set the variables:
    $HOME: the location of the cmcdwnlr files or "apps"
        folder.
    $HEADER: if you're counting, how do you want your
        comics labeled? I went for Comic001.jpg etc. so I
        wrote "Comic"
    $CONF_FILE: the location of your comic configurations
    $DOWN_FILE: the location of file to show where to
        download. (or just set "$O_DIR" directly...)
    $BLANK_FILE: the location of your 'blank' image. In
        the case of a missing image on a page with
        counting. See 'skip_ahead'

There is also the "wget" command variable in the
"downloader.pl" file:
    $WGET_COMMAND: the command that runs wget. I added:
        --user-agent=Mozilla
        --tries=inf
        So that it appears as if wget is a browser, and it
        continues trying till the site is downloaded
If there are any better ideas, ideas that allow
flexibility and ease of use, I welcome them. Send me an
e-mail.


==Conf file==
The file is set up in the following format:
    COMICNAME:
        mode1=""
        mode2=""
The modes are predesignated, but the names are not. You can
chose any name for your comic. This name is the name of
the folder the comic will be saved in.
There are currently these Modes:
- url: the start url, or the "expression" for the
    images/pages
- pic: the regular expression for the image. When used the
    'url' will be assumed to be an "expression" for the
    page the image is located at.
- next: the regular expression for the next page. When
    used the 'url' will be assumed to be the start page.
    'pic' must also be assigned.
- get_text: if set to true (or almost anything else), then
    the title text (or mouse-over text) of the downloaded
    image, will also be gathered, and saved to
    $dir/$comic.title.$ext
    This option needs to have 'pic' set in order to
    function.
- skip: When incrementing, occasionally comics are skipped.
    In the case of xkcd where the 404 did not have a comic,
    'skip' would look a designated number of pages, and see
    if there is a comic. If there is not, then it closes. If
    there is, it copies "$BLANK" as a placeholder. The
    placeholder is used so cmcdwnldr does not use bandwidth
    to try and download it again.
    When Parsing, sometimes a page is left blank for other
    reasons. If skip is labeled, then it will simply assume
    the parsing did not fail, and go to the next page.
- last: Sometimes a comic will have other files at the end.
    If there is a place on any page pointing to the last
    number locate it with the "last" regex. It's expecting a
    number. Then the parser will stop on this number.
- tier: The comics are sorted by this. Tier 1 comics are
    downloaded first, then tier 2, etc.
Note: In both modes, a bad regular expression could trigger
skip. It's wise to make sure that you're regular expression
is lax enough, and sometimes even go to the website in the
browser to make sure nothing is missed.


=='url' expressions==
In the case where you are using the "counting" downloader,
the url will contain a number. In order to allow cmcdwnlr
to insert the number, use a {} to show location. The
contents of the brackets will change depending on the
"padding"
    1-9999  : {0} effectively turning off padding:
        ex:         http://comic-wo-padding.com/{0}.png
        will catch: http://comic-wo-padding.com/1.png
        and:        http://comic-wo-padding.com/99.png
        but not:    http;//comic-wo-padding.com/09.png
    001-999 : {3} pad the numbers to three:
        ex:         http://comic-w-padding.com/{3}.png
        will catch: http://comic-w-padding.com/001.png
        and:        http://comic-w-padding.com/999.png
        but not:    http://comic-w-padding.com/0001.png
        or:         http://comic-w-padding.com/1.png


=='pic'/'next'/'last' expressions==
It's a wise idea to look up regular expressions. Since
it's easy to mess up on the expression, testing is a good
idea. Try using 'regex_tester.pl' to test your see if
your expression will work.
The actual link, or picture, should be enclosed in the
first group of parens. Parenthesis can be used to enclose
other portions of text (and even nested), but the first
enclosed group of text will be assumed to be the
link/picture/text to act upon.


==Acting on URLS==
Since many sites embed relative links, cmcdwnlr will act
upon the parsed url/picture as if it is in the browser.
ie:
    If the path is of the form:
        /path/to/file
    then it will be counted as a domain link.
    If the path is of the form:
        path/to/file
    then it will be counted as a relative link from the
    current page.
    If the path is of the form:
        http://comic.com/path/to/file
    then it will be counted as a new link entirely.
It should be noted, that even if a website says that the
file is a gif, cmcdwnlr will use imagemagick to determine
the file's actual file type.


==Start of Downloading==
Downloading stats differently depending on which mode:
-Counting Mode: In this case, the downloader looks for the
    first file in the folder, and pulls out the number,
    stripping any leading zeros. This number is used to
    continue the downloading. Then it checks each number
    to see if the file is downloaded already. If it is
    missing from the folder, it will download it.
-Parsing Mode: If the designated comic has already been
    run, it will have a "next_url" file saved. This file
    contains the url to continue from. From there it gets
    the page, checks the url, sees if the file is already
    in the folder, and if not, downloads it.


==Tricks of Downloading==
This is the place where I comment on some of the tricks
cmcdwnlr uses when downloading.
1) Wget allows for more flexibility with downloading:
    - infinite re-tries
    - telling the server it's a web browser
2) Sometimes websites embed a jpg file (for example) but
    call it a gif. This means when downloading, if the
    extension is saved as a gif file, programs won't be
    able to open it.
    Identify (part of the imagemagick package) allows
    identification of the image, so the correct extension
    will be applied at the end.


==End of Downloading==
Cases where downloading ends:
1) If the header of 'url' returns a 404, exit. In the case
    it does not, it's assumed that there's a picture
    inside, or it is the picture.
2) If after parsing with 'pic' or 'next' no path is found:
    exit.
3) If after parsing with 'next' the url parsed is the
    same: exit.
4) If last is set, and the number it is on is greater than
    the 'last' image, then stop before downloading.
If 'skip' is set, this can allow continued downloading.
However, only in the case of no picture (in case 2).


==Extras==
--cbzerize--
In the folder is a Bash script called *cbzerize*. The
purpose of it is to create comic books of your downloaded
comics. It uses the zip format. It has has been not been
tested by anyone other then me. Use at your own risk.
That being said, the file should not (used loosely) delete
comics. It only moves them to a folder so that they can
later be restored.
You will need a comic book reader in order to open the
files. Of course, evince works for this.
--Regex--
There is a *regex_tester.pl* tool that can be used to help
find the right expression for downloading. Keep in mind
that it's useful to test more then one page before you
decide that the expression is good enough.
If you installed using the "install" script, then you'll
find it as /usr/bin/cmcdwnlr2-regex.
--Bash Completion--
For those of you using bash, there is also a Bash
completion script. If you used the "install" script, it
will install on it's own. Otherwise copy it to
/etc/bash_completion.d/
It automatically reads your conf file, and finds the
appropriate comics to run.


==Bug Reports==
Bug reports without the 'conf' file will be ignored.
Please include any and all cmcdwnlr output when the error
occurs.


width:64