Compare commits

...

13 Commits

Author SHA1 Message Date
2288806105 fix #77: use processed ad dir for duplicate checking, not slug 2024-02-10 15:15:43 +01:00
5a2c277f0e fix #71 and #73: add support for outdir template and enhance docs 2024-02-10 14:44:09 +01:00
612ed2aa79 fix #74: warn if about to write to already visited ad, overwrite if -f 2024-02-10 14:44:09 +01:00
ed78731b3c check seek error 2024-01-27 17:34:44 +01:00
a84f0e1436 get rid of duplicate bytes.Buffer, use bytes.Reader instead, #39 2024-01-27 17:34:44 +01:00
d8d5be5c7d fix #58: add missing dashes to self issue template 2024-01-27 17:34:44 +01:00
bcf920c91e correct #39 add --ignoreerrors flag 2024-01-27 17:34:44 +01:00
T.v.Dein
14f8c3fd43 Fix/linter (#66)
* added lint targets
* fix linter errors
* enhance error handling
* !!BREAKING!! rename Id to ID in tpls
2024-01-25 19:04:15 +01:00
9cd1fc0596 behavior changes: UserAgent configurable, test cookies, check errors 2024-01-24 19:22:31 +01:00
8df3ebfa6d add throttling to image download 2024-01-24 19:22:31 +01:00
de82127223 first step in fixing #49:
fetch cookies from 1st response and use them in subsequent requests.
2024-01-24 19:22:31 +01:00
a79a28f4a1 add contribution guidelines and non-code-of-conduct 2024-01-23 18:01:14 +01:00
95b1172b7f fix typo 2024-01-23 17:26:06 +01:00
24 changed files with 934 additions and 234 deletions

View File

@@ -5,3 +5,4 @@ title: "[bug-report]"
labels: bug labels: bug
assignees: TLINDEN assignees: TLINDEN
---

114
CODE_OF_CONDUCT.md Normal file
View File

@@ -0,0 +1,114 @@
# No Code of Conduct
*TL;DR:* This project does **NOT** have a so called Code of Conduct,
nor will it ever have one.
## The Rant
The reasons are somewhat complicated and I'll try my best to document
them here.
Ethical codes or rules come along like laws. But how is ethical or
moral behavior defined? And who defines which behavior is ethical and
which is not? Certainly not me.
Unless you live in a dictatorship (and more than half of the
population on planet earth do as of this writing), laws come into
existence by democratic procedures. Laws cover almost every aspect of
live in a society. Laws allow and forbid behavior and laws sanction
infringements.
A software project like this one on the other hand is not a society.
There are not enough people involved to form democratic
structures. And there will always be a minority of users who have the
right to commit or reject code. How could any maintainer of a software
project dare to decree rules upon others? Actually, am I, the current
maintainer of this very project authorized to do so?
I think the anser to this question clearly is NO.
The issue is being complicated by the fact, that open source
development these days happens on a planetary scale. And this planet
houses hundreds if not thousands of different cultures, philosophies,
ideologies and worldviews. The answer to many ethical questions will
in most cases be vague and nebulous.
Ones joke will always be another ones insult.
Then there is the problem of language. I myself am not an english
native, but I publish everyting using the english language. I am able
to communicate with most people in the open source community because
of that. But I am certainly not able to understand everything and
everyone. There might be nuances to a sentence I don't sense, there
might be sarcastic connotations I don't understand or references to
historical figures, events or traditions I don't know and never have
heard of.
Judging over other peoples online behavior looks like a titanic task
to me. It is just not my job to judge others, I am not legitimized or
authorized to do so and I am not interested in this kind of business.
Another huge problem with ethical rules is that you need to outline
and enforce sanctions on those who violate the rules. But since I am
not an elected authority how would I be able to do this? I don't
know. And what happens if someone complains about myself? Shall I
remove myself from my own project? Come on!
Last but not least there's the law. So, let's say someone in india
writes something insulting to some other developer in an issue. Of
course german law does not apply to indian people. Moreover, the
insult might actually not be an insult in india. In the end, nothing
would happen. Under normal circumstances, maintainers would
eventually delete the posting, ban the user or remove push privileges
etc.
But then, is there a way for the offending user to defend himself? Of
course not, since neither indian or german law alone applies. I cannot
go to a german court and sue the guy and he cannot do the same in
india. Or - we possibly could but the judges in both countries would
just laugh and close the case.
That being said, I don't have the power nor the tools, nor the
authority to enforce serious sanctions of any meaningful kind against
others. Therefore I cannot outline any rules whatsoever.
And let's not even start talking about these undemocratic "comitees"
many projects are forming to circumvent this problem. Some projects
even include external entities like a lawyer or some bureaucrat
somewhere just to have the ability to complain against a comitee
member. What a mess!
## So, what are the ethical rules within this project then?
Well, there are none.
This project is about code, not society. It doesn't matter where you
come from, how you look, how you think, what you believe, who your
friends are, whay you said or did sometime in the past. I don't even
care if you are a human being. You are an alien so bored that you need
to submit code on github? Fine with me. You're a convicted criminal? I
don't give a shit!
**The only thing I am interested here is Code and only Code.**
So if anyhing happens here I don't like or I am obliged by (german!)
law to act on, I will decide on a case to case basis what to do. And
unfortunately, since this is the nature of a github project, you
cannot complain, object or protest. I am very sorry!
If you will, let's at least outline these:
- Please - just please - behave towards others as you'd expect others
to behave towards yourself.
- Don't judge others for any reason.
- Only judge the code.
But these are not rules, only a friendly appeal to you as a developer
and user.
Thanks a lot!

93
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,93 @@
## Project Goals
The goal of this project is to build a small tool which helps in
maintaining backups of the german ad site kleinanzeigen.de. It should
be small, fast and easy to understand.
There will be no GUI, no web interface, no public API of some sort, no
builtin interpreter.
The programming language used for this project will always be
[GOLANG](https://go.dev/) with the exception of the documentation
([Perl POD](https://perldoc.perl.org/perlpod)) and the Makefile.
# Contributing
You can contribute to this project in various ways:
## Open an issue
If you encounter a problem or don't understand how the program works
or if you think the documentation is unclear, please don't hesitate to
open an issue.
Please add as much information about the case as possible, such as:
- Your environment (operating system etc)
- kleingebaeck version (`kleingebaeck --version`)
- Commandline used. Please replace sensitive information with mock data!
- Repeat the command with debugging enabled (`-d` flag)
- Actual program output, Please replace sensitive information with mock data!
- Expected program output.
- Error message - if any.
Be aware that I am working on this (and some others) project in my
spare time which is scarce. Therefore please don't expect me to
respond to your query within hours or even days. Be patient, but I
WILL respond.
## Pull Requests
Code and documentation help is always much appreciated! Please follow
thes guidelines to successfully contribute:
- Every pull request shall be based on latest `development`
branch. `main` is only used for releases.
- Execute the unit tests before committing: `make test`. There shall
be no errors.
- Strive to be backwards compatible so that users who are already
using the program don't have to change their habits - unless it is
really neccessary.
- Try to add a unit test for your fix, addition or modification.
- Don't ever change existing unit tests!
- Add a meaningful and comprehensive rationale about your contribution:
- Why do you think it might be useful for others?
- What did you actually change or add?
- Is there an open issue which this PR fixes and if so, please link
to that issue.
- [Re-]format your code with `gofmt -s`.
- Avoid unneccesary dependencies, especially for very small functions.
- **If** a new dependency is being added, it must be compatible with
our [license agreement](LICENSE).
- You need to accept that the code or documentation you contribute
will be redistributed under the terms of said license agreement. If
your contribution is considerably large or if you contribute
regularly, then feel free to add your name (and if you want your
email address) to the *AUTHORS* section of the
[manpage](kleingebaeck.pod).
- Adhere to the above mentioned project goals.
- If you are unsure if your addition or change will be accepted,
better ask before starting coding. Open an issue about your proposal
and let's discuss it! That way we avoid doing unnessesary work on
both sides.
Each pull request will be carefully reviewed and if it is a useful
addition it will be accepted. However, please be prepared that
sometimes a PR will be rejected. The reasons may vary and will be
documented. Perhaps the above guidelines are not matched, or the
addition seems to be not so useful from my perspective, maybe there
are too much changes or there might be changes I don't even
understand.
But whatever happens: your contribution is always welcome!

View File

@@ -56,6 +56,14 @@ test: clean
mkdir -p t/out mkdir -p t/out
go test ./... $(ARGS) go test ./... $(ARGS)
testlint: test lint
lint:
golangci-lint run
lint-full:
golangci-lint run --enable-all --exclude-use-default --disable exhaustivestruct,exhaustruct,depguard,interfacer,deadcode,golint,structcheck,scopelint,varcheck,ifshort,maligned,nosnakecase,godot,funlen,gofumpt,cyclop,noctx,gochecknoglobals,paralleltest
testfuzzy: clean testfuzzy: clean
go test -fuzz ./... $(ARGS) go test -fuzz ./... $(ARGS)
@@ -88,5 +96,5 @@ show-versions: buildlocal
@echo "### go version used for building:" @echo "### go version used for building:"
@grep -m 1 go go.mod @grep -m 1 go go.mod
lint: # lint:
golangci-lint run -p bugs -p unused # golangci-lint run -p bugs -p unused

View File

@@ -8,6 +8,7 @@
![GitHub License](https://img.shields.io/github/license/tlinden/kleingebaeck) ![GitHub License](https://img.shields.io/github/license/tlinden/kleingebaeck)
[![GitHub release](https://img.shields.io/github/v/release/tlinden/kleingebaeck?color=%2300a719)](https://github.com/TLINDEN/kleingebaeck/releases/latest) [![GitHub release](https://img.shields.io/github/v/release/tlinden/kleingebaeck?color=%2300a719)](https://github.com/TLINDEN/kleingebaeck/releases/latest)
[![English](https://github.com/TLINDEN/kleingebaeck/blob/main/.github/assets/english.png)](https://github.com/tlinden/kleingebaeck/blob/main/README.md) [![English](https://github.com/TLINDEN/kleingebaeck/blob/main/.github/assets/english.png)](https://github.com/tlinden/kleingebaeck/blob/main/README.md)
Mit diesem Tool kann man seine Anzeigen bei https://kleinanzeigen.de sichern. Mit diesem Tool kann man seine Anzeigen bei https://kleinanzeigen.de sichern.
Es kann alle Anzeigen eines Users (oder nur eine Ausgewählte) Es kann alle Anzeigen eines Users (oder nur eine Ausgewählte)

8
ad.go
View File

@@ -1,5 +1,5 @@
/* /*
Copyright © 2023 Thomas von Dein Copyright © 2023-2024 Thomas von Dein
This program is free software: you can redistribute it and/or modify This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by it under the terms of the GNU General Public License as published by
@@ -30,7 +30,7 @@ type Index struct {
type Ad struct { type Ad struct {
Title string `goquery:"h1"` Title string `goquery:"h1"`
Slug string Slug string
Id string ID string
Condition string `goquery:".addetailslist--detail--value,text"` Condition string `goquery:".addetailslist--detail--value,text"`
Category string Category string
CategoryTree []string `goquery:".breadcrump-link,text"` CategoryTree []string `goquery:".breadcrump-link,text"`
@@ -46,7 +46,7 @@ func (ad *Ad) LogValue() slog.Value {
return slog.GroupValue( return slog.GroupValue(
slog.String("title", ad.Title), slog.String("title", ad.Title),
slog.String("price", ad.Price), slog.String("price", ad.Price),
slog.String("id", ad.Id), slog.String("id", ad.ID),
slog.Int("imagecount", len(ad.Images)), slog.Int("imagecount", len(ad.Images)),
slog.Int("bodysize", len(ad.Text)), slog.Int("bodysize", len(ad.Text)),
slog.String("categorytree", strings.Join(ad.CategoryTree, "+")), slog.String("categorytree", strings.Join(ad.CategoryTree, "+")),
@@ -76,7 +76,7 @@ func (ad *Ad) CalculateExpire() {
if len(ad.Created) > 0 { if len(ad.Created) > 0 {
ts, err := time.Parse("02.01.2006", ad.Created) ts, err := time.Parse("02.01.2006", ad.Created)
if err == nil { if err == nil {
ad.Expire = ts.AddDate(0, 2, 1).Format("02.01.2006") ad.Expire = ts.AddDate(0, ExpireMonths, ExpireDays).Format("02.01.2006")
} }
} }
} }

105
config.go
View File

@@ -17,7 +17,6 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
package main package main
import ( import (
"errors"
"fmt" "fmt"
"io" "io"
"os" "os"
@@ -35,21 +34,41 @@ import (
) )
const ( const (
VERSION string = "0.3.0" VERSION string = "0.3.4"
Baseuri string = "https://www.kleinanzeigen.de" Baseuri string = "https://www.kleinanzeigen.de"
Listuri string = "/s-bestandsliste.html" Listuri string = "/s-bestandsliste.html"
Defaultdir string = "." Defaultdir string = "."
DefaultTemplate string = "Title: {{.Title}}\nPrice: {{.Price}}\nId: {{.Id}}\n" +
DefaultTemplate string = "Title: {{.Title}}\nPrice: {{.Price}}\nId: {{.ID}}\n" +
"Category: {{.Category}}\nCondition: {{.Condition}}\n" + "Category: {{.Category}}\nCondition: {{.Condition}}\n" +
"Created: {{.Created}}\nExpire: {{.Expire}}\n\n{{.Text}}\n" "Created: {{.Created}}\nExpire: {{.Expire}}\n\n{{.Text}}\n"
DefaultTemplateWin string = "Title: {{.Title}}\r\nPrice: {{.Price}}\r\nId: {{.Id}}\r\n" +
DefaultTemplateWin string = "Title: {{.Title}}\r\nPrice: {{.Price}}\r\nId: {{.ID}}\r\n" +
"Category: {{.Category}}\r\nCondition: {{.Condition}}\r\n" + "Category: {{.Category}}\r\nCondition: {{.Condition}}\r\n" +
"Created: {{.Created}}\r\nExpires: {{.Expire}}\r\n\r\n{{.Text}}\r\n" "Created: {{.Created}}\r\nExpires: {{.Expire}}\r\n\r\n{{.Text}}\r\n"
Useragent string = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
DefaultUserAgent string = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
DefaultAdNameTemplate string = "{{.Slug}}" DefaultAdNameTemplate string = "{{.Slug}}"
DefaultOutdirTemplate string = "."
// for image download throttling
MinThrottle int = 2
MaxThrottle int = 20
// we extract the slug from the uri
SlugURIPartNum int = 6
ExpireMonths int = 2
ExpireDays int = 1
WIN string = "windows"
) )
var DirsVisited map[string]int
const Usage string = `This is kleingebaeck, the kleinanzeigen.de backup tool. const Usage string = `This is kleingebaeck, the kleinanzeigen.de backup tool.
Usage: kleingebaeck [-dvVhmoclu] [<ad-listing-url>,...] Usage: kleingebaeck [-dvVhmoclu] [<ad-listing-url>,...]
@@ -62,7 +81,7 @@ Options:
-l --limit <num> Limit the ads to download to <num>, default: load all. -l --limit <num> Limit the ads to download to <num>, default: load all.
-c --config <file> Use config file <file> (default: ~/.kleingebaeck). -c --config <file> Use config file <file> (default: ~/.kleingebaeck).
--ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup. --ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
-f --force Download images even if they already exist. -f --force Overwrite images and ads even if the already exist.
-m --manual Show manual. -m --manual Show manual.
-h --help Show usage. -h --help Show usage.
-V --version Show program version. -V --version Show program version.
@@ -84,6 +103,7 @@ type Config struct {
Limit int `koanf:"limit"` Limit int `koanf:"limit"`
IgnoreErrors bool `koanf:"ignoreerrors"` IgnoreErrors bool `koanf:"ignoreerrors"`
ForceDownload bool `koanf:"force"` ForceDownload bool `koanf:"force"`
UserAgent string `koanf:"useragent"` // conf only
Adlinks []string Adlinks []string
StatsCountAds int StatsCountAds int
StatsCountImages int StatsCountImages int
@@ -98,54 +118,58 @@ func (c *Config) IncrImgs(num int) {
} }
// load commandline flags and config file // load commandline flags and config file
func InitConfig(w io.Writer) (*Config, error) { func InitConfig(output io.Writer) (*Config, error) {
var k = koanf.New(".") var kloader = koanf.New(".")
// determine template based on os // determine template based on os
template := DefaultTemplate template := DefaultTemplate
if runtime.GOOS == "windows" { if runtime.GOOS == WIN {
template = DefaultTemplateWin template = DefaultTemplateWin
} }
// Load default values using the confmap provider. // Load default values using the confmap provider.
if err := k.Load(confmap.Provider(map[string]interface{}{ if err := kloader.Load(confmap.Provider(map[string]interface{}{
"template": template, "template": template,
"outdir": ".", "outdir": DefaultOutdirTemplate,
"loglevel": "notice", "loglevel": "notice",
"userid": 0, "userid": 0,
"adnametemplate": DefaultAdNameTemplate, "adnametemplate": DefaultAdNameTemplate,
"useragent": DefaultUserAgent,
}, "."), nil); err != nil { }, "."), nil); err != nil {
return nil, err return nil, fmt.Errorf("failed to load default values into koanf: %w", err)
} }
// setup custom usage // setup custom usage
f := flag.NewFlagSet("config", flag.ContinueOnError) flagset := flag.NewFlagSet("config", flag.ContinueOnError)
f.Usage = func() { flagset.Usage = func() {
fmt.Fprintln(w, Usage) fmt.Fprintln(output, Usage)
os.Exit(0) os.Exit(0)
} }
// parse commandline flags // parse commandline flags
f.StringP("config", "c", "", "config file") flagset.StringP("config", "c", "", "config file")
f.StringP("outdir", "o", "", "directory where to store ads") flagset.StringP("outdir", "o", "", "directory where to store ads")
f.IntP("user", "u", 0, "user id") flagset.IntP("user", "u", 0, "user id")
f.IntP("limit", "l", 0, "limit ads to be downloaded (default 0, unlimited)") flagset.IntP("limit", "l", 0, "limit ads to be downloaded (default 0, unlimited)")
f.BoolP("verbose", "v", false, "be verbose") flagset.BoolP("verbose", "v", false, "be verbose")
f.BoolP("debug", "d", false, "enable debug log") flagset.BoolP("debug", "d", false, "enable debug log")
f.BoolP("version", "V", false, "show program version") flagset.BoolP("version", "V", false, "show program version")
f.BoolP("help", "h", false, "show usage") flagset.BoolP("help", "h", false, "show usage")
f.BoolP("manual", "m", false, "show manual") flagset.BoolP("manual", "m", false, "show manual")
f.BoolP("force", "f", false, "force") flagset.BoolP("force", "f", false, "force")
flagset.BoolP("ignoreerrors", "", false, "ignore image download HTTP errors")
if err := f.Parse(os.Args[1:]); err != nil { if err := flagset.Parse(os.Args[1:]); err != nil {
return nil, err return nil, fmt.Errorf("failed to parse program arguments: %w", err)
} }
// generate a list of config files to try to load, including the // generate a list of config files to try to load, including the
// one provided via -c, if any // one provided via -c, if any
var configfiles []string var configfiles []string
configfile, _ := f.GetString("config")
configfile, _ := flagset.GetString("config")
home, _ := os.UserHomeDir() home, _ := os.UserHomeDir()
if configfile != "" { if configfile != "" {
configfiles = []string{configfile} configfiles = []string{configfile}
} else { } else {
@@ -161,31 +185,30 @@ func InitConfig(w io.Writer) (*Config, error) {
for _, cfgfile := range configfiles { for _, cfgfile := range configfiles {
if path, err := os.Stat(cfgfile); !os.IsNotExist(err) { if path, err := os.Stat(cfgfile); !os.IsNotExist(err) {
if !path.IsDir() { if !path.IsDir() {
if err := k.Load(file.Provider(cfgfile), toml.Parser()); err != nil { if err := kloader.Load(file.Provider(cfgfile), toml.Parser()); err != nil {
return nil, errors.New("error loading config file: " + err.Error()) return nil, fmt.Errorf("error loading config file: %w", err)
} }
} }
} } // else: we ignore the file if it doesn't exists
// else: we ignore the file if it doesn't exists
} }
// env overrides config file // env overrides config file
if err := k.Load(env.Provider("KLEINGEBAECK_", ".", func(s string) string { if err := kloader.Load(env.Provider("KLEINGEBAECK_", ".", func(s string) string {
return strings.Replace(strings.ToLower( return strings.ReplaceAll(strings.ToLower(
strings.TrimPrefix(s, "KLEINGEBAECK_")), "_", ".", -1) strings.TrimPrefix(s, "KLEINGEBAECK_")), "_", ".")
}), nil); err != nil { }), nil); err != nil {
return nil, errors.New("error loading environment: " + err.Error()) return nil, fmt.Errorf("error loading environment: %w", err)
} }
// command line overrides env // command line overrides env
if err := k.Load(posflag.Provider(f, ".", k), nil); err != nil { if err := kloader.Load(posflag.Provider(flagset, ".", kloader), nil); err != nil {
return nil, errors.New("error loading flags: " + err.Error()) return nil, fmt.Errorf("error loading flags: %w", err)
} }
// fetch values // fetch values
conf := &Config{} conf := &Config{}
if err := k.Unmarshal("", &conf); err != nil { if err := kloader.Unmarshal("", &conf); err != nil {
return nil, errors.New("error unmarshalling: " + err.Error()) return nil, fmt.Errorf("error unmarshalling: %w", err)
} }
// adjust loglevel // adjust loglevel
@@ -197,7 +220,7 @@ func InitConfig(w io.Writer) (*Config, error) {
} }
// are there any args left on commandline? if so threat them as adlinks // are there any args left on commandline? if so threat them as adlinks
conf.Adlinks = f.Args() conf.Adlinks = flagset.Args()
return conf, nil return conf, nil
} }

View File

@@ -19,55 +19,84 @@ package main
import ( import (
"errors" "errors"
"fmt"
"io" "io"
"log/slog" "log/slog"
"net/http" "net/http"
"net/http/cookiejar"
"net/url"
) )
// convenient wrapper to fetch some web content // convenient wrapper to fetch some web content
type Fetcher struct { type Fetcher struct {
Config *Config Config *Config
Client *http.Client Client *http.Client
Useragent string // FIXME: make configurable Cookies []*http.Cookie
} }
func NewFetcher(c *Config) *Fetcher { func NewFetcher(conf *Config) (*Fetcher, error) {
return &Fetcher{ jar, err := cookiejar.New(nil)
Client: &http.Client{Transport: &loggingTransport{}}, // implemented in http.go if err != nil {
Useragent: Useragent, // default in config.go return nil, fmt.Errorf("failed to create a cookie jar obj: %w", err)
Config: c,
} }
return &Fetcher{
Client: &http.Client{
Transport: &loggingTransport{}, // implemented in http.go
Jar: jar,
},
Config: conf,
Cookies: []*http.Cookie{},
},
nil
} }
func (f *Fetcher) Get(uri string) (io.ReadCloser, error) { func (f *Fetcher) Get(uri string) (io.ReadCloser, error) {
req, err := http.NewRequest("GET", uri, nil) req, err := http.NewRequest(http.MethodGet, uri, nil)
if err != nil { if err != nil {
return nil, err return nil, fmt.Errorf("failed to create a new HTTP request obj: %w", err)
} }
req.Header.Set("User-Agent", f.Useragent) req.Header.Set("User-Agent", f.Config.UserAgent)
if len(f.Cookies) > 0 {
uriobj, _ := url.Parse(Baseuri)
slog.Debug("have cookies, sending them",
"sample-cookie-name", f.Cookies[0].Name,
"sample-cookie-expire", f.Cookies[0].Expires,
)
f.Client.Jar.SetCookies(uriobj, f.Cookies)
}
res, err := f.Client.Do(req) res, err := f.Client.Do(req)
if err != nil { if err != nil {
return nil, err return nil, fmt.Errorf("failed to initiate HTTP request to %s: %w", uri, err)
} }
if res.StatusCode != 200 { if res.StatusCode != http.StatusOK {
return nil, errors.New("could not get page via HTTP") return nil, errors.New("could not get page via HTTP")
} }
slog.Debug("got cookies?", "cookies", res.Cookies())
f.Cookies = res.Cookies()
return res.Body, nil return res.Body, nil
} }
// fetch an image // fetch an image
func (f *Fetcher) Getimage(uri string) (io.ReadCloser, error) { func (f *Fetcher) Getimage(uri string) (io.ReadCloser, error) {
slog.Debug("fetching ad image", "uri", uri) slog.Debug("fetching ad image", "uri", uri)
body, err := f.Get(uri) body, err := f.Get(uri)
if err != nil { if err != nil {
if f.Config.IgnoreErrors { if f.Config.IgnoreErrors {
slog.Info("Failed to download image, error ignored", "error", err.Error()) slog.Info("Failed to download image, error ignored", "error", err.Error())
return nil, nil return nil, nil
} }
return nil, err return nil, err
} }

5
go.mod
View File

@@ -14,7 +14,7 @@ require (
github.com/lmittmann/tint v1.0.4 github.com/lmittmann/tint v1.0.4
github.com/mattn/go-isatty v0.0.20 github.com/mattn/go-isatty v0.0.20
github.com/spf13/pflag v1.0.5 github.com/spf13/pflag v1.0.5
github.com/tlinden/yadu v0.1.1 github.com/tlinden/yadu v0.1.2
golang.org/x/sync v0.5.0 golang.org/x/sync v0.5.0
) )
@@ -31,8 +31,9 @@ require (
github.com/mitchellh/reflectwalk v1.0.2 // indirect github.com/mitchellh/reflectwalk v1.0.2 // indirect
github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646 // indirect github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646 // indirect
github.com/pelletier/go-toml v1.9.5 // indirect github.com/pelletier/go-toml v1.9.5 // indirect
github.com/pkg/errors v0.9.1 // indirect
golang.org/x/net v0.0.0-20220722155237-a158d28d115b // indirect golang.org/x/net v0.0.0-20220722155237-a158d28d115b // indirect
golang.org/x/sys v0.14.0 // indirect golang.org/x/sys v0.17.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect
) )

6
go.sum
View File

@@ -50,6 +50,8 @@ github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646 h1:zYyBkD/k9seD2A7fsi6
github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646/go.mod h1:jpp1/29i3P1S/RLdc7JQKbRpFeM1dOBd8T9ki5s+AY8= github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646/go.mod h1:jpp1/29i3P1S/RLdc7JQKbRpFeM1dOBd8T9ki5s+AY8=
github.com/pelletier/go-toml v1.9.5 h1:4yBQzkHv+7BHq2PQUZF3Mx0IYxG7LsP222s7Agd3ve8= github.com/pelletier/go-toml v1.9.5 h1:4yBQzkHv+7BHq2PQUZF3Mx0IYxG7LsP222s7Agd3ve8=
github.com/pelletier/go-toml v1.9.5/go.mod h1:u1nR/EPcESfeI/szUZKdtJ0xRNbUoANCkoOuaOx1Y+c= github.com/pelletier/go-toml v1.9.5/go.mod h1:u1nR/EPcESfeI/szUZKdtJ0xRNbUoANCkoOuaOx1Y+c=
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA= github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
@@ -64,6 +66,8 @@ github.com/tlinden/yadu v0.1.0 h1:qtCi1jxg392qVRLFyrJ2LYu6/PiKSp1LT02EX+mNLME=
github.com/tlinden/yadu v0.1.0/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA= github.com/tlinden/yadu v0.1.0/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA=
github.com/tlinden/yadu v0.1.1 h1:116oEUy9b4PcMF5wLL2dCFA/sn/praYutOnao07MROw= github.com/tlinden/yadu v0.1.1 h1:116oEUy9b4PcMF5wLL2dCFA/sn/praYutOnao07MROw=
github.com/tlinden/yadu v0.1.1/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA= github.com/tlinden/yadu v0.1.1/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA=
github.com/tlinden/yadu v0.1.2 h1:TYYVnUJwziRJ9YPbIbRf9ikmDw0Q8Ifixm+J/kBQFh8=
github.com/tlinden/yadu v0.1.2/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/net v0.0.0-20180218175443-cbe0f9307d01/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= golang.org/x/net v0.0.0-20180218175443-cbe0f9307d01/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
@@ -79,6 +83,8 @@ golang.org/x/sys v0.0.0-20220908164124-27713097b956/go.mod h1:oPkhp1MJrh7nUepCBc
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.14.0 h1:Vz7Qs629MkJkGyHxUlRHizWJRG2j8fbQKjELVSNhy7Q= golang.org/x/sys v0.14.0 h1:Vz7Qs629MkJkGyHxUlRHizWJRG2j8fbQKjELVSNhy7Q=
golang.org/x/sys v0.14.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= golang.org/x/sys v0.14.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/sys v0.17.0 h1:25cE3gD+tdBA7lp7QfhuV+rJiE9YXTcS3VG1SqssI/Y=
golang.org/x/sys v0.17.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=

37
http.go
View File

@@ -19,6 +19,7 @@ package main
import ( import (
"bytes" "bytes"
"fmt"
"io" "io"
"log/slog" "log/slog"
"math" "math"
@@ -32,17 +33,20 @@ import (
// easier associated in debug output // easier associated in debug output
var letters = []rune("ABCDEF0123456789") var letters = []rune("ABCDEF0123456789")
func getid() string { const IDLEN int = 8
b := make([]rune, 8)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
// retry after HTTP 50x errors or err!=nil // retry after HTTP 50x errors or err!=nil
const RetryCount = 3 const RetryCount = 3
func getid() string {
b := make([]rune, IDLEN)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
// used to inject debug log and implement retries // used to inject debug log and implement retries
type loggingTransport struct{} type loggingTransport struct{}
@@ -75,6 +79,7 @@ func drainBody(resp *http.Response) {
// unable to copy data? uff! // unable to copy data? uff!
panic(err) panic(err)
} }
resp.Body.Close() resp.Body.Close()
} }
} }
@@ -82,8 +87,8 @@ func drainBody(resp *http.Response) {
// the actual logging transport with retries // the actual logging transport with retries
func (t *loggingTransport) RoundTrip(req *http.Request) (*http.Response, error) { func (t *loggingTransport) RoundTrip(req *http.Request) (*http.Response, error) {
// just requred for debugging // just required for debugging
id := getid() requestid := getid()
// clone the request body, put into request on retry // clone the request body, put into request on retry
var bodyBytes []byte var bodyBytes []byte
@@ -92,16 +97,16 @@ func (t *loggingTransport) RoundTrip(req *http.Request) (*http.Response, error)
req.Body = io.NopCloser(bytes.NewBuffer(bodyBytes)) req.Body = io.NopCloser(bytes.NewBuffer(bodyBytes))
} }
slog.Debug("REQUEST", "id", id, "uri", req.URL, "host", req.Host) slog.Debug("REQUEST", "id", requestid, "uri", req.URL, "host", req.Host)
// first try // first try
resp, err := http.DefaultTransport.RoundTrip(req) resp, err := http.DefaultTransport.RoundTrip(req)
if err == nil { if err == nil {
slog.Debug("RESPONSE", "id", id, "status", resp.StatusCode, slog.Debug("RESPONSE", "id", requestid, "status", resp.StatusCode,
"contentlength", resp.ContentLength) "contentlength", resp.ContentLength)
} }
// enter retry check and loop, if first req were successfull, leave loop immediately // enter retry check and loop, if first req were successful, leave loop immediately
retries := 0 retries := 0
for shouldRetry(err, resp) && retries < RetryCount { for shouldRetry(err, resp) && retries < RetryCount {
time.Sleep(backoff(retries)) time.Sleep(backoff(retries))
@@ -118,12 +123,16 @@ func (t *loggingTransport) RoundTrip(req *http.Request) (*http.Response, error)
resp, err = http.DefaultTransport.RoundTrip(req) resp, err = http.DefaultTransport.RoundTrip(req)
if err == nil { if err == nil {
slog.Debug("RESPONSE", "id", id, "status", resp.StatusCode, slog.Debug("RESPONSE", "id", requestid, "status", resp.StatusCode,
"contentlength", resp.ContentLength, "retry", retries) "contentlength", resp.ContentLength, "retry", retries)
} }
retries++ retries++
} }
return resp, err if err != nil {
return resp, fmt.Errorf("failed to get HTTP response for %s: %w", req.URL, err)
}
return resp, nil
} }

View File

@@ -19,6 +19,7 @@ package main
import ( import (
"bytes" "bytes"
"fmt"
"image/jpeg" "image/jpeg"
"log/slog" "log/slog"
"os" "os"
@@ -32,15 +33,15 @@ const MaxDistance = 3
type Image struct { type Image struct {
Filename string Filename string
Hash *goimagehash.ImageHash Hash *goimagehash.ImageHash
Data *bytes.Buffer Data *bytes.Reader
Uri string URI string
} }
// used for logging to avoid printing Data // used for logging to avoid printing Data
func (img *Image) LogValue() slog.Value { func (img *Image) LogValue() slog.Value {
return slog.GroupValue( return slog.GroupValue(
slog.String("filename", img.Filename), slog.String("filename", img.Filename),
slog.String("uri", img.Uri), slog.String("uri", img.URI),
slog.String("hash", img.Hash.ToString()), slog.String("hash", img.Hash.ToString()),
) )
} }
@@ -48,10 +49,10 @@ func (img *Image) LogValue() slog.Value {
// holds all images of an ad // holds all images of an ad
type Cache []*goimagehash.ImageHash type Cache []*goimagehash.ImageHash
func NewImage(buf *bytes.Buffer, filename string, uri string) *Image { func NewImage(buf *bytes.Reader, filename string, uri string) *Image {
img := &Image{ img := &Image{
Filename: filename, Filename: filename,
Uri: uri, URI: uri,
Data: buf, Data: buf,
} }
@@ -62,12 +63,12 @@ func NewImage(buf *bytes.Buffer, filename string, uri string) *Image {
func (img *Image) CalcHash() error { func (img *Image) CalcHash() error {
jpgdata, err := jpeg.Decode(img.Data) jpgdata, err := jpeg.Decode(img.Data)
if err != nil { if err != nil {
return err return fmt.Errorf("failed to decode JPEG image: %w", err)
} }
hash1, err := goimagehash.DifferenceHash(jpgdata) hash1, err := goimagehash.DifferenceHash(jpgdata)
if err != nil { if err != nil {
return err return fmt.Errorf("failed to calculate diff hash of image: %w", err)
} }
img.Hash = hash1 img.Hash = hash1
@@ -80,16 +81,18 @@ func (img *Image) Similar(hash *goimagehash.ImageHash) bool {
distance, err := img.Hash.Distance(hash) distance, err := img.Hash.Distance(hash)
if err != nil { if err != nil {
slog.Debug("failed to compute diff hash distance", "error", err) slog.Debug("failed to compute diff hash distance", "error", err)
return false return false
} }
if distance < MaxDistance { if distance < MaxDistance {
slog.Debug("distance computation", "image-A", img.Hash.ToString(), slog.Debug("distance computation", "image-A", img.Hash.ToString(),
"image-B", hash.ToString(), "distance", distance) "image-B", hash.ToString(), "distance", distance)
return true return true
} else {
return false
} }
return false
} }
// check current image against all known hashes. // check current image against all known hashes.
@@ -108,7 +111,7 @@ func (img *Image) SimilarExists(cache Cache) bool {
func ReadImages(addir string, dont bool) (Cache, error) { func ReadImages(addir string, dont bool) (Cache, error) {
files, err := os.ReadDir(addir) files, err := os.ReadDir(addir)
if err != nil { if err != nil {
return nil, err return nil, fmt.Errorf("failed to read ad directory contents: %w", err)
} }
cache := Cache{} cache := Cache{}
@@ -122,12 +125,15 @@ func ReadImages(addir string, dont bool) (Cache, error) {
ext := filepath.Ext(file.Name()) ext := filepath.Ext(file.Name())
if !file.IsDir() && (ext == ".jpg" || ext == ".jpeg" || ext == ".JPG" || ext == ".JPEG") { if !file.IsDir() && (ext == ".jpg" || ext == ".jpeg" || ext == ".JPG" || ext == ".JPEG") {
filename := filepath.Join(addir, file.Name()) filename := filepath.Join(addir, file.Name())
data, err := ReadImage(filename) data, err := ReadImage(filename)
if err != nil { if err != nil {
return nil, err return nil, err
} }
img := NewImage(data, filename, "") reader := bytes.NewReader(data.Bytes())
img := NewImage(reader, filename, "")
if err = img.CalcHash(); err != nil { if err = img.CalcHash(); err != nil {
return nil, err return nil, err
} }
@@ -137,6 +143,5 @@ func ReadImages(addir string, dont bool) (Cache, error) {
} }
} }
//return nil, errors.New("ende")
return cache, nil return cache, nil
} }

View File

@@ -133,7 +133,7 @@
.\" ======================================================================== .\" ========================================================================
.\" .\"
.IX Title "KLEINGEBAECK 1" .IX Title "KLEINGEBAECK 1"
.TH KLEINGEBAECK 1 "2024-01-22" "1" "User Commands" .TH KLEINGEBAECK 1 "2024-02-10" "1" "User Commands"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents. .\" way too many mistakes in technical documents.
.if n .ad l .if n .ad l
@@ -152,7 +152,7 @@ kleingebaeck \- kleinanzeigen.de backup tool
\& \-l \-\-limit <num> Limit the ads to download to <num>, default: load all. \& \-l \-\-limit <num> Limit the ads to download to <num>, default: load all.
\& \-c \-\-config <file> Use config file <file> (default: ~/.kleingebaeck). \& \-c \-\-config <file> Use config file <file> (default: ~/.kleingebaeck).
\& \-\-ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup. \& \-\-ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
\& \-f \-\-force Download images even if they already exist. \& \-f \-\-force Overwrite images and ads even if the already exist.
\& \-m \-\-manual Show manual. \& \-m \-\-manual Show manual.
\& \-h \-\-help Show usage. \& \-h \-\-help Show usage.
\& \-V \-\-version Show program version. \& \-V \-\-version Show program version.
@@ -174,14 +174,15 @@ well. We use \s-1TOML\s0 as our configuration language. See
.PP .PP
Format is pretty simple: Format is pretty simple:
.PP .PP
.Vb 10 .Vb 11
\& user = 1010101 \& user = 1010101
\& loglevel = verbose \& loglevel = verbose
\& outdir = "test" \& outdir = "test"
\& useragent = "Mozilla/5.0"
\& template = """ \& template = """
\& Title: {{.Title}} \& Title: {{.Title}}
\& Price: {{.Price}} \& Price: {{.Price}}
\& Id: {{.Id}} \& Id: {{.ID}}
\& Category: {{.Category}} \& Category: {{.Category}}
\& Condition: {{.Condition}} \& Condition: {{.Condition}}
\& Created: {{.Created}} \& Created: {{.Created}}
@@ -190,11 +191,11 @@ Format is pretty simple:
\& """ \& """
.Ve .Ve
.PP .PP
Be carefull if you want to change the template. The variable is a Be careful if you want to change the template. The variable is a
multiline string surrounded by three double quotes. You can left out multiline string surrounded by three double quotes. You can left out
certain fields and use any formatting you like. Refer to certain fields and use any formatting you like. Refer to
<https://pkg.go.dev/text/template> for details how to write a <https://pkg.go.dev/text/template> for details how to write a
template. template. Also read the \s-1TEMPLATES\s0 section below.
.PP .PP
If you're on windows and want to customize the output directory, put If you're on windows and want to customize the output directory, put
it into single quotes to avoid the backslashes interpreted as escape it into single quotes to avoid the backslashes interpreted as escape
@@ -203,6 +204,94 @@ chars like this:
.Vb 1 .Vb 1
\& outdir = \*(AqC:\eData\eAds\*(Aq \& outdir = \*(AqC:\eData\eAds\*(Aq
.Ve .Ve
.SH "TEMPLATES"
.IX Header "TEMPLATES"
Various parts of the configuration can be modified using templates:
the output directory, the ad directory and the ad listing itself.
.SS "\s-1OUTPUT DIR TEMPLATE\s0"
.IX Subsection "OUTPUT DIR TEMPLATE"
The config varialbe \f(CW\*(C`outdir\*(C'\fR or the command line parameter \f(CW\*(C`\-o\*(C'\fR take a
template which may contain:
.ie n .IP """{{.Year}}""" 4
.el .IP "\f(CW{{.Year}}\fR" 4
.IX Item "{{.Year}}"
.PD 0
.ie n .IP """{{.Month}}""" 4
.el .IP "\f(CW{{.Month}}\fR" 4
.IX Item "{{.Month}}"
.ie n .IP """{{.Day}}""" 4
.el .IP "\f(CW{{.Day}}\fR" 4
.IX Item "{{.Day}}"
.PD
.PP
That way you can create a new output directory for every backup
run. For example:
.PP
.Vb 1
\& outdir = "/home/backups/ads\-{{.Year}}\-{{.Month}}\-{{.Day}}"
.Ve
.PP
Or using the command line flag:
.PP
.Vb 1
\& \-o "/home/backups/ads\-{{.Year}}\-{{.Month}}\-{{.Day}}"
.Ve
.PP
The default value is \f(CW\*(C`.\*(C'\fR \- the current directory.
.SS "\s-1AD DIRECTORY TEMPLATE\s0"
.IX Subsection "AD DIRECTORY TEMPLATE"
The ad directory name can be modified using the following ad values:
.IP "{{.Price}}" 4
.IX Item "{{.Price}}"
.PD 0
.IP "{{.ID}}" 4
.IX Item "{{.ID}}"
.IP "{{.Category}}" 4
.IX Item "{{.Category}}"
.IP "{{.Condition}}" 4
.IX Item "{{.Condition}}"
.IP "{{.Created}}" 4
.IX Item "{{.Created}}"
.IP "{{.Slug}}" 4
.IX Item "{{.Slug}}"
.IP "{{.Text}}" 4
.IX Item "{{.Text}}"
.PD
.PP
It can only be configured in the config file. By default only
\&\f(CW\*(C`{{.Slug}}\*(C'\fR is being used, this is the title of the ad in url format.
.SS "\s-1AD TEMPLATE\s0"
.IX Subsection "AD TEMPLATE"
The ad listing itself can be modified as well, using the same
variables as the ad name template above.
.PP
This is the default template:
.PP
.Vb 7
\& Title: {{.Title}}
\& Price: {{.Price}}
\& Id: {{.ID}}
\& Category: {{.Category}}
\& Condition: {{.Condition}}
\& Created: {{.Created}}
\& Expire: {{.Expire}}
\&
\& {{.Text}}
.Ve
.PP
The config parameter to modify is \f(CW\*(C`template\*(C'\fR. See example.conf in the
source repository. Please take care, since this is a multiline
string. This is how it shall look if you modify it:
.PP
.Vb 2
\& template="""
\& Title: {{.Title}}
\&
\& {{.Text}}
\& """
.Ve
.PP
That is, the content between the two \f(CW"""\fR chars is the template.
.SH "SETUP" .SH "SETUP"
.IX Header "SETUP" .IX Header "SETUP"
To setup the tool, you need to lookup your userid on To setup the tool, you need to lookup your userid on

View File

@@ -14,7 +14,7 @@ SYNOPSYS
-l --limit <num> Limit the ads to download to <num>, default: load all. -l --limit <num> Limit the ads to download to <num>, default: load all.
-c --config <file> Use config file <file> (default: ~/.kleingebaeck). -c --config <file> Use config file <file> (default: ~/.kleingebaeck).
--ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup. --ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
-f --force Download images even if they already exist. -f --force Overwrite images and ads even if the already exist.
-m --manual Show manual. -m --manual Show manual.
-h --help Show usage. -h --help Show usage.
-V --version Show program version. -V --version Show program version.
@@ -39,10 +39,11 @@ CONFIGURATION
user = 1010101 user = 1010101
loglevel = verbose loglevel = verbose
outdir = "test" outdir = "test"
useragent = "Mozilla/5.0"
template = """ template = """
Title: {{.Title}} Title: {{.Title}}
Price: {{.Price}} Price: {{.Price}}
Id: {{.Id}} Id: {{.ID}}
Category: {{.Category}} Category: {{.Category}}
Condition: {{.Condition}} Condition: {{.Condition}}
Created: {{.Created}} Created: {{.Created}}
@@ -50,10 +51,11 @@ CONFIGURATION
{{.Text}} {{.Text}}
""" """
Be carefull if you want to change the template. The variable is a Be careful if you want to change the template. The variable is a
multiline string surrounded by three double quotes. You can left out multiline string surrounded by three double quotes. You can left out
certain fields and use any formatting you like. Refer to certain fields and use any formatting you like. Refer to
<https://pkg.go.dev/text/template> for details how to write a template. <https://pkg.go.dev/text/template> for details how to write a template.
Also read the TEMPLATES section below.
If you're on windows and want to customize the output directory, put it If you're on windows and want to customize the output directory, put it
into single quotes to avoid the backslashes interpreted as escape chars into single quotes to avoid the backslashes interpreted as escape chars
@@ -61,6 +63,71 @@ CONFIGURATION
outdir = 'C:\Data\Ads' outdir = 'C:\Data\Ads'
TEMPLATES
Various parts of the configuration can be modified using templates: the
output directory, the ad directory and the ad listing itself.
OUTPUT DIR TEMPLATE
The config varialbe "outdir" or the command line parameter "-o" take a
template which may contain:
"{{.Year}}"
"{{.Month}}"
"{{.Day}}"
That way you can create a new output directory for every backup run. For
example:
outdir = "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
Or using the command line flag:
-o "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
The default value is "." - the current directory.
AD DIRECTORY TEMPLATE
The ad directory name can be modified using the following ad values:
{{.Price}}
{{.ID}}
{{.Category}}
{{.Condition}}
{{.Created}}
{{.Slug}}
{{.Text}}
It can only be configured in the config file. By default only
"{{.Slug}}" is being used, this is the title of the ad in url format.
AD TEMPLATE
The ad listing itself can be modified as well, using the same variables
as the ad name template above.
This is the default template:
Title: {{.Title}}
Price: {{.Price}}
Id: {{.ID}}
Category: {{.Category}}
Condition: {{.Condition}}
Created: {{.Created}}
Expire: {{.Expire}}
{{.Text}}
The config parameter to modify is "template". See example.conf in the
source repository. Please take care, since this is a multiline string.
This is how it shall look if you modify it:
template="""
Title: {{.Title}}
{{.Text}}
"""
That is, the content between the two """ chars is the template.
SETUP SETUP
To setup the tool, you need to lookup your userid on kleinanzeigen.de. To setup the tool, you need to lookup your userid on kleinanzeigen.de.
Go to your ad overview page while NOT being logged in: Go to your ad overview page while NOT being logged in:

View File

@@ -13,7 +13,7 @@ kleingebaeck - kleinanzeigen.de backup tool
-l --limit <num> Limit the ads to download to <num>, default: load all. -l --limit <num> Limit the ads to download to <num>, default: load all.
-c --config <file> Use config file <file> (default: ~/.kleingebaeck). -c --config <file> Use config file <file> (default: ~/.kleingebaeck).
--ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup. --ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
-f --force Download images even if they already exist. -f --force Overwrite images and ads even if the already exist.
-m --manual Show manual. -m --manual Show manual.
-h --help Show usage. -h --help Show usage.
-V --version Show program version. -V --version Show program version.
@@ -39,10 +39,11 @@ Format is pretty simple:
user = 1010101 user = 1010101
loglevel = verbose loglevel = verbose
outdir = "test" outdir = "test"
useragent = "Mozilla/5.0"
template = """ template = """
Title: {{.Title}} Title: {{.Title}}
Price: {{.Price}} Price: {{.Price}}
Id: {{.Id}} Id: {{.ID}}
Category: {{.Category}} Category: {{.Category}}
Condition: {{.Condition}} Condition: {{.Condition}}
Created: {{.Created}} Created: {{.Created}}
@@ -50,11 +51,11 @@ Format is pretty simple:
{{.Text}} {{.Text}}
""" """
Be carefull if you want to change the template. The variable is a Be careful if you want to change the template. The variable is a
multiline string surrounded by three double quotes. You can left out multiline string surrounded by three double quotes. You can left out
certain fields and use any formatting you like. Refer to certain fields and use any formatting you like. Refer to
L<https://pkg.go.dev/text/template> for details how to write a L<https://pkg.go.dev/text/template> for details how to write a
template. template. Also read the TEMPLATES section below.
If you're on windows and want to customize the output directory, put If you're on windows and want to customize the output directory, put
it into single quotes to avoid the backslashes interpreted as escape it into single quotes to avoid the backslashes interpreted as escape
@@ -62,6 +63,91 @@ chars like this:
outdir = 'C:\Data\Ads' outdir = 'C:\Data\Ads'
=head1 TEMPLATES
Various parts of the configuration can be modified using templates:
the output directory, the ad directory and the ad listing itself.
=head2 OUTPUT DIR TEMPLATE
The config varialbe C<outdir> or the command line parameter C<-o> take a
template which may contain:
=over
=item C<{{.Year}}>
=item C<{{.Month}}>
=item C<{{.Day}}>
=back
That way you can create a new output directory for every backup
run. For example:
outdir = "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
Or using the command line flag:
-o "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
The default value is C<.> - the current directory.
=head2 AD DIRECTORY TEMPLATE
The ad directory name can be modified using the following ad values:
=over
=item {{.Price}}
=item {{.ID}}
=item {{.Category}}
=item {{.Condition}}
=item {{.Created}}
=item {{.Slug}}
=item {{.Text}}
=back
It can only be configured in the config file. By default only
C<{{.Slug}}> is being used, this is the title of the ad in url format.
=head2 AD TEMPLATE
The ad listing itself can be modified as well, using the same
variables as the ad name template above.
This is the default template:
Title: {{.Title}}
Price: {{.Price}}
Id: {{.ID}}
Category: {{.Category}}
Condition: {{.Condition}}
Created: {{.Created}}
Expire: {{.Expire}}
{{.Text}}
The config parameter to modify is C<template>. See example.conf in the
source repository. Please take care, since this is a multiline
string. This is how it shall look if you modify it:
template="""
Title: {{.Title}}
{{.Text}}
"""
That is, the content between the two C<"""> chars is the template.
=head1 SETUP =head1 SETUP
To setup the tool, you need to lookup your userid on To setup the tool, you need to lookup your userid on

57
main.go
View File

@@ -35,38 +35,43 @@ func main() {
os.Exit(Main(os.Stdout)) os.Exit(Main(os.Stdout))
} }
func Main(w io.Writer) int { func Main(output io.Writer) int {
logLevel := &slog.LevelVar{} logLevel := &slog.LevelVar{}
opts := &tint.Options{ opts := &tint.Options{
Level: logLevel, Level: logLevel,
AddSource: false, AddSource: false,
ReplaceAttr: func(groups []string, a slog.Attr) slog.Attr { ReplaceAttr: func(groups []string, attr slog.Attr) slog.Attr {
// Remove time from the output // Remove time from the output
if a.Key == slog.TimeKey { if attr.Key == slog.TimeKey {
return slog.Attr{} return slog.Attr{}
} }
return a
return attr
}, },
NoColor: IsNoTty(), NoColor: IsNoTty(),
} }
logLevel.Set(LevelNotice) logLevel.Set(LevelNotice)
handler := tint.NewHandler(w, opts)
handler := tint.NewHandler(output, opts)
logger := slog.New(handler) logger := slog.New(handler)
slog.SetDefault(logger) slog.SetDefault(logger)
conf, err := InitConfig(w) conf, err := InitConfig(output)
if err != nil { if err != nil {
return Die(err) return Die(err)
} }
if conf.Showversion { if conf.Showversion {
fmt.Fprintf(w, "This is kleingebaeck version %s\n", VERSION) fmt.Fprintf(output, "This is kleingebaeck version %s\n", VERSION)
return 0 return 0
} }
if conf.Showhelp { if conf.Showhelp {
fmt.Fprintln(w, Usage) fmt.Fprintln(output, Usage)
return 0 return 0
} }
@@ -75,6 +80,7 @@ func Main(w io.Writer) int {
if err != nil { if err != nil {
return Die(err) return Die(err)
} }
return 0 return 0
} }
@@ -92,7 +98,8 @@ func Main(w io.Writer) int {
} }
logLevel.Set(slog.LevelDebug) logLevel.Set(slog.LevelDebug)
handler := yadu.NewHandler(w, opts)
handler := yadu.NewHandler(output, opts)
debuglogger := slog.New(handler).With( debuglogger := slog.New(handler).With(
slog.Group("program_info", slog.Group("program_info",
slog.Int("pid", os.Getpid()), slog.Int("pid", os.Getpid()),
@@ -105,15 +112,28 @@ func Main(w io.Writer) int {
slog.Debug("config", "conf", conf) slog.Debug("config", "conf", conf)
// prepare output dir // prepare output dir
err = Mkdir(conf.Outdir) outdir, err := OutDirName(conf)
if err != nil { if err != nil {
return Die(err) return Die(err)
} }
// used for all HTTP requests err = Mkdir(outdir)
fetch := NewFetcher(conf) if err != nil {
return Die(err)
}
conf.Outdir = outdir
if len(conf.Adlinks) >= 1 { // used for all HTTP requests
fetch, err := NewFetcher(conf)
if err != nil {
return Die(err)
}
// setup ad dir registry, needed to check for duplicates
DirsVisited = make(map[string]int)
switch {
case len(conf.Adlinks) >= 1:
// directly backup ad listing[s] // directly backup ad listing[s]
for _, uri := range conf.Adlinks { for _, uri := range conf.Adlinks {
err := ScrapeAd(fetch, uri) err := ScrapeAd(fetch, uri)
@@ -121,25 +141,27 @@ func Main(w io.Writer) int {
return Die(err) return Die(err)
} }
} }
} else if conf.User > 0 { case conf.User > 0:
// backup all ads of the given user (via config or cmdline) // backup all ads of the given user (via config or cmdline)
err := ScrapeUser(fetch) err := ScrapeUser(fetch)
if err != nil { if err != nil {
return Die(err) return Die(err)
} }
} else { default:
return Die(errors.New("invalid or no user id or no ad link specified")) return Die(errors.New("invalid or no user id or no ad link specified"))
} }
if conf.StatsCountAds > 0 { if conf.StatsCountAds > 0 {
adstr := "ads" adstr := "ads"
if conf.StatsCountAds == 1 { if conf.StatsCountAds == 1 {
adstr = "ad" adstr = "ad"
} }
fmt.Fprintf(w, "Successfully downloaded %d %s with %d images to %s.\n",
fmt.Fprintf(output, "Successfully downloaded %d %s with %d images to %s.\n",
conf.StatsCountAds, adstr, conf.StatsCountImages, conf.Outdir) conf.StatsCountAds, adstr, conf.StatsCountImages, conf.Outdir)
} else { } else {
fmt.Fprintf(w, "No ads found.") fmt.Fprintf(output, "No ads found.")
} }
return 0 return 0
@@ -147,5 +169,6 @@ func Main(w io.Writer) int {
func Die(err error) int { func Die(err error) int {
slog.Error("Failure", "error", err.Error()) slog.Error("Failure", "error", err.Error())
return 1 return 1
} }

View File

@@ -21,6 +21,7 @@ import (
"bytes" "bytes"
"errors" "errors"
"fmt" "fmt"
"net/http"
"os" "os"
"strings" "strings"
"testing" "testing"
@@ -42,7 +43,7 @@ const LISTTPL string = `<!DOCTYPE html>
{{ range . }} {{ range . }}
<h2 class="text-module-begin"> <h2 class="text-module-begin">
<a class="ellipsis" <a class="ellipsis"
href="/s-anzeige/{{ .Slug }}/{{ .Id }}">{{ .Title }}</a> href="/s-anzeige/{{ .Slug }}/{{ .ID }}">{{ .Title }}</a>
</h2> </h2>
{{ end }} {{ end }}
</body> </body>
@@ -246,7 +247,7 @@ var invalidtests = []Tests{
type AdConfig struct { type AdConfig struct {
Title string Title string
Slug string Slug string
Id string ID string
Price string Price string
Category string Category string
Condition string Condition string
@@ -258,7 +259,7 @@ type AdConfig struct {
var adsrc = []AdConfig{ var adsrc = []AdConfig{
{ {
Title: "First Ad", Title: "First Ad",
Id: "1", Price: "5€", ID: "1", Price: "5€",
Category: "Klimbim", Category: "Klimbim",
Text: "Thing to sale", Text: "Thing to sale",
Slug: "first-ad", Slug: "first-ad",
@@ -268,7 +269,7 @@ var adsrc = []AdConfig{
}, },
{ {
Title: "Secnd Ad", Title: "Secnd Ad",
Id: "2", Price: "5€", ID: "2", Price: "5€",
Category: "Kram", Category: "Kram",
Text: "Thing to sale", Text: "Thing to sale",
Slug: "second-ad", Slug: "second-ad",
@@ -278,7 +279,7 @@ var adsrc = []AdConfig{
}, },
{ {
Title: "Third Ad", Title: "Third Ad",
Id: "3", ID: "3",
Price: "5€", Price: "5€",
Category: "Kuddelmuddel", Category: "Kuddelmuddel",
Text: "Thing to sale", Text: "Thing to sale",
@@ -289,7 +290,7 @@ var adsrc = []AdConfig{
}, },
{ {
Title: "Forth Ad", Title: "Forth Ad",
Id: "4", ID: "4",
Price: "5€", Price: "5€",
Category: "Krempel", Category: "Krempel",
Text: "Thing to sale", Text: "Thing to sale",
@@ -300,7 +301,7 @@ var adsrc = []AdConfig{
}, },
{ {
Title: "Fifth Ad", Title: "Fifth Ad",
Id: "5", ID: "5",
Price: "5€", Price: "5€",
Category: "Kladderadatsch", Category: "Kladderadatsch",
Text: "Thing to sale", Text: "Thing to sale",
@@ -311,7 +312,7 @@ var adsrc = []AdConfig{
}, },
{ {
Title: "Sixth Ad", Title: "Sixth Ad",
Id: "6", ID: "6",
Price: "5€", Price: "5€",
Category: "Klunker", Category: "Klunker",
Text: "Thing to sale", Text: "Thing to sale",
@@ -333,17 +334,17 @@ type Adsource struct {
} }
// Render a HTML template for an adlisting or an ad // Render a HTML template for an adlisting or an ad
func GetTemplate(l []AdConfig, a AdConfig, htmltemplate string) string { func GetTemplate(adconfigs []AdConfig, adconfig AdConfig, htmltemplate string) string {
tmpl, err := tpl.New("template").Parse(htmltemplate) tmpl, err := tpl.New("template").Parse(htmltemplate)
if err != nil { if err != nil {
panic(err) panic(err)
} }
var out bytes.Buffer var out bytes.Buffer
if len(a.Id) == 0 { if len(adconfig.ID) == 0 {
err = tmpl.Execute(&out, l) err = tmpl.Execute(&out, adconfigs)
} else { } else {
err = tmpl.Execute(&out, a) err = tmpl.Execute(&out, adconfig)
} }
if err != nil { if err != nil {
@@ -390,10 +391,9 @@ func InitValidSources() []Adsource {
// prepare urls for the ads // prepare urls for the ads
for _, ad := range adsrc { for _, ad := range adsrc {
ads = append(ads, Adsource{ ads = append(ads, Adsource{
uri: fmt.Sprintf("%s/s-anzeige/%s/%s", Baseuri, ad.Slug, ad.Id), uri: fmt.Sprintf("%s/s-anzeige/%s/%s", Baseuri, ad.Slug, ad.ID),
content: GetTemplate(nil, ad, ADTPL), content: GetTemplate(nil, ad, ADTPL),
}) })
//panic(GetTemplate(nil, ad, ADTPL))
} }
return ads return ads
@@ -446,43 +446,48 @@ func GetImage(path string) []byte {
// setup httpmock // setup httpmock
func SetIntercept(ads []Adsource) { func SetIntercept(ads []Adsource) {
for _, ad := range ads { headers := http.Header{}
if ad.status == 0 { headers.Add("Set-Cookie", "session=permanent")
ad.status = 200
for _, advertisement := range ads {
if advertisement.status == 0 {
advertisement.status = 200
} }
httpmock.RegisterResponder("GET", ad.uri, httpmock.RegisterResponder("GET", advertisement.uri,
httpmock.NewStringResponder(ad.status, ad.content)) httpmock.NewStringResponder(advertisement.status, advertisement.content).HeaderAdd(headers))
} }
// we just use 2 images, put this here // we just use 2 images, put this here
for _, image := range []string{"t/1.jpg", "t/2.jpg"} { for _, image := range []string{"t/1.jpg", "t/2.jpg"} {
httpmock.RegisterResponder("GET", image, httpmock.RegisterResponder("GET", image,
httpmock.NewBytesResponder(200, GetImage(image))) httpmock.NewBytesResponder(200, GetImage(image)).HeaderAdd(headers))
} }
} }
func VerifyAd(ad AdConfig) error { func VerifyAd(advertisement AdConfig) error {
body := ad.Title + ad.Price + ad.Id + "Kleinanzeigen => " + body := advertisement.Title + advertisement.Price + advertisement.ID + "Kleinanzeigen => " +
ad.Category + ad.Condition + ad.Created advertisement.Category + advertisement.Condition + advertisement.Created
// prepare ad dir name using DefaultAdNameTemplate // prepare ad dir name using DefaultAdNameTemplate
c := Config{Adnametemplate: "{{ .Slug }}"} c := Config{Adnametemplate: "{{ .Slug }}"}
adstruct := Ad{Slug: ad.Slug, Id: ad.Id} adstruct := Ad{Slug: advertisement.Slug, ID: advertisement.ID}
addir, err := AdDirName(&c, &adstruct) addir, err := AdDirName(&c, &adstruct)
if err != nil { if err != nil {
return err return err
} }
file := fmt.Sprintf("t/out/%s/Adlisting.txt", addir) file := fmt.Sprintf("t/out/%s/Adlisting.txt", addir)
content, err := os.ReadFile(file) content, err := os.ReadFile(file)
if err != nil { if err != nil {
return err return fmt.Errorf("unable to read adlisting file: %w", err)
} }
if body != strings.TrimSpace(string(content)) { if body != strings.TrimSpace(string(content)) {
msg := fmt.Sprintf("ad content doesn't match.\nExpect: %s\n Got: %s\n", body, content) msg := fmt.Sprintf("ad content doesn't match.\nExpect: %s\n Got: %s\n", body, content)
return errors.New(msg) return errors.New(msg)
} }
@@ -500,20 +505,21 @@ func TestMain(t *testing.T) {
SetIntercept(InitValidSources()) SetIntercept(InitValidSources())
// run commandline tests // run commandline tests
for _, tt := range tests { for _, test := range tests {
var buf bytes.Buffer var buf bytes.Buffer
os.Args = strings.Split(tt.args, " ")
os.Args = strings.Split(test.args, " ")
ret := Main(&buf) ret := Main(&buf)
if ret != tt.exitcode { if ret != test.exitcode {
t.Errorf("%s with cmd <%s> did not exit with %d but %d", t.Errorf("%s with cmd <%s> did not exit with %d but %d",
tt.name, tt.args, tt.exitcode, ret) test.name, test.args, test.exitcode, ret)
} }
if !strings.Contains(buf.String(), tt.expect) { if !strings.Contains(buf.String(), test.expect) {
t.Errorf("%s with cmd <%s> output did not match.\nExpect: %s\n Got: %s\n", t.Errorf("%s with cmd <%s> output did not match.\nExpect: %s\n Got: %s\n",
tt.name, tt.args, tt.expect, buf.String()) test.name, test.args, test.expect, buf.String())
} }
} }
@@ -536,20 +542,21 @@ func TestMainInvalids(t *testing.T) {
SetIntercept(InitInvalidSources()) SetIntercept(InitInvalidSources())
// run commandline tests // run commandline tests
for _, tt := range invalidtests { for _, test := range invalidtests {
var buf bytes.Buffer var buf bytes.Buffer
os.Args = strings.Split(tt.args, " ")
os.Args = strings.Split(test.args, " ")
ret := Main(&buf) ret := Main(&buf)
if ret != tt.exitcode { if ret != test.exitcode {
t.Errorf("%s with cmd <%s> did not exit with %d but %d", t.Errorf("%s with cmd <%s> did not exit with %d but %d",
tt.name, tt.args, tt.exitcode, ret) test.name, test.args, test.exitcode, ret)
} }
if !strings.Contains(buf.String(), tt.expect) { if !strings.Contains(buf.String(), test.expect) {
t.Errorf("%s with cmd <%s> output did not match.\nExpect: %s\n Got: %s\n", t.Errorf("%s with cmd <%s> output did not match.\nExpect: %s\n Got: %s\n",
tt.name, tt.args, tt.expect, buf.String()) test.name, test.args, test.expect, buf.String())
} }
} }
} }

108
scrape.go
View File

@@ -19,11 +19,12 @@ package main
import ( import (
"bytes" "bytes"
"errors"
"fmt" "fmt"
"log/slog" "log/slog"
"path/filepath" "path/filepath"
"strconv"
"strings" "strings"
"time"
"astuart.co/goq" "astuart.co/goq"
"golang.org/x/sync/errgroup" "golang.org/x/sync/errgroup"
@@ -42,7 +43,9 @@ func ScrapeUser(fetch *Fetcher) error {
for { for {
var index Index var index Index
slog.Debug("fetching page", "uri", uri) slog.Debug("fetching page", "uri", uri)
body, err := fetch.Get(uri) body, err := fetch.Get(uri)
if err != nil { if err != nil {
return err return err
@@ -51,7 +54,7 @@ func ScrapeUser(fetch *Fetcher) error {
err = goq.NewDecoder(body).Decode(&index) err = goq.NewDecoder(body).Decode(&index)
if err != nil { if err != nil {
return err return fmt.Errorf("failed to goquery decode HTML index body: %w", err)
} }
if len(index.Links) == 0 { if len(index.Links) == 0 {
@@ -66,16 +69,16 @@ func ScrapeUser(fetch *Fetcher) error {
} }
page++ page++
uri = baseuri + "&pageNum=" + fmt.Sprintf("%d", page) uri = baseuri + "&pageNum=" + strconv.Itoa(page)
} }
for i, adlink := range adlinks { for index, adlink := range adlinks {
err := ScrapeAd(fetch, Baseuri+adlink) err := ScrapeAd(fetch, Baseuri+adlink)
if err != nil { if err != nil {
return err return err
} }
if fetch.Config.Limit > 0 && i == fetch.Config.Limit-1 { if fetch.Config.Limit > 0 && index == fetch.Config.Limit-1 {
break break
} }
} }
@@ -85,18 +88,20 @@ func ScrapeUser(fetch *Fetcher) error {
// scrape an ad. uri is the full uri of the ad, dir is the basedir // scrape an ad. uri is the full uri of the ad, dir is the basedir
func ScrapeAd(fetch *Fetcher, uri string) error { func ScrapeAd(fetch *Fetcher, uri string) error {
ad := &Ad{} advertisement := &Ad{}
// extract slug and id from uri // extract slug and id from uri
uriparts := strings.Split(uri, "/") uriparts := strings.Split(uri, "/")
if len(uriparts) < 6 { if len(uriparts) < SlugURIPartNum {
return errors.New("invalid uri: " + uri) return fmt.Errorf("invalid uri: %s", uri)
} }
ad.Slug = uriparts[4]
ad.Id = uriparts[5] advertisement.Slug = uriparts[4]
advertisement.ID = uriparts[5]
// get the ad // get the ad
slog.Debug("fetching ad page", "uri", uri) slog.Debug("fetching ad page", "uri", uri)
body, err := fetch.Get(uri) body, err := fetch.Get(uri)
if err != nil { if err != nil {
return err return err
@@ -104,36 +109,53 @@ func ScrapeAd(fetch *Fetcher, uri string) error {
defer body.Close() defer body.Close()
// extract ad contents with goquery/goq // extract ad contents with goquery/goq
err = goq.NewDecoder(body).Decode(&ad) err = goq.NewDecoder(body).Decode(&advertisement)
if err != nil {
return fmt.Errorf("failed to goquery decode HTML ad body: %w", err)
}
if len(advertisement.CategoryTree) > 0 {
advertisement.Category = strings.Join(advertisement.CategoryTree, " => ")
}
if advertisement.Incomplete() {
slog.Debug("got ad", "ad", advertisement)
return fmt.Errorf("could not extract ad data from page, got empty struct")
}
advertisement.CalculateExpire()
// prepare ad dir name
addir, err := AdDirName(fetch.Config, advertisement)
if err != nil { if err != nil {
return err return err
} }
if len(ad.CategoryTree) > 0 { proceed := CheckAdVisited(fetch.Config, addir)
ad.Category = strings.Join(ad.CategoryTree, " => ") if !proceed {
return nil
} }
if ad.Incomplete() {
slog.Debug("got ad", "ad", ad)
return errors.New("could not extract ad data from page, got empty struct")
}
ad.CalculateExpire()
// write listing // write listing
addir, err := WriteAd(fetch.Config, ad) err = WriteAd(fetch.Config, advertisement, addir)
if err != nil { if err != nil {
return err return err
} }
slog.Debug("extracted ad listing", "ad", ad) // tell the user
slog.Debug("extracted ad listing", "ad", advertisement)
// stats
fetch.Config.IncrAds() fetch.Config.IncrAds()
return ScrapeImages(fetch, ad, addir) // register for later checks
DirsVisited[addir] = 1
return ScrapeImages(fetch, advertisement, addir)
} }
func ScrapeImages(fetch *Fetcher, ad *Ad, addir string) error { func ScrapeImages(fetch *Fetcher, advertisement *Ad, addir string) error {
// fetch images // fetch images
img := 1 img := 1
adpath := filepath.Join(fetch.Config.Outdir, addir) adpath := filepath.Join(fetch.Config.Outdir, addir)
@@ -144,26 +166,33 @@ func ScrapeImages(fetch *Fetcher, ad *Ad, addir string) error {
return err return err
} }
g := new(errgroup.Group) egroup := new(errgroup.Group)
for _, imguri := range ad.Images { for _, imguri := range advertisement.Images {
imguri := imguri imguri := imguri
file := filepath.Join(adpath, fmt.Sprintf("%d.jpg", img)) file := filepath.Join(adpath, fmt.Sprintf("%d.jpg", img))
g.Go(func() error {
egroup.Go(func() error {
// wait a little
throttle := GetThrottleTime()
time.Sleep(throttle)
body, err := fetch.Getimage(imguri) body, err := fetch.Getimage(imguri)
if err != nil { if err != nil {
return err return err
} }
buf := new(bytes.Buffer) buf := new(bytes.Buffer)
_, err = buf.ReadFrom(body) _, err = buf.ReadFrom(body)
if err != nil { if err != nil {
return err return fmt.Errorf("failed to read from image buffer: %w", err)
} }
buf2 := buf.Bytes() // needed for image writing reader := bytes.NewReader(buf.Bytes())
image := NewImage(buf, "", imguri) image := NewImage(reader, file, imguri)
err = image.CalcHash() err = image.CalcHash()
if err != nil { if err != nil {
return err return err
@@ -171,27 +200,34 @@ func ScrapeImages(fetch *Fetcher, ad *Ad, addir string) error {
if !fetch.Config.ForceDownload { if !fetch.Config.ForceDownload {
if image.SimilarExists(cache) { if image.SimilarExists(cache) {
slog.Debug("similar image exists, not written", "uri", image.Uri) slog.Debug("similar image exists, not written", "uri", image.URI)
return nil return nil
} }
} }
err = WriteImage(file, buf2) _, err = reader.Seek(0, 0)
if err != nil {
return fmt.Errorf("failed to seek(0) on image reader: %w", err)
}
err = WriteImage(file, reader)
if err != nil { if err != nil {
return err return err
} }
slog.Debug("wrote image", "image", image, "size", len(buf2)) slog.Debug("wrote image", "image", image, "size", buf.Len(), "throttle", throttle)
return nil return nil
}) })
img++ img++
} }
if err := g.Wait(); err != nil { if err := egroup.Wait(); err != nil {
return err return fmt.Errorf("failed to finalize error waitgroup: %w", err)
} }
fetch.Config.IncrImgs(len(ad.Images)) fetch.Config.IncrImgs(len(advertisement.Images))
return nil return nil
} }

108
store.go
View File

@@ -26,77 +26,102 @@ import (
"runtime" "runtime"
"strings" "strings"
tpl "text/template" tpl "text/template"
"time"
) )
func AdDirName(c *Config, ad *Ad) (string, error) { type OutdirData struct {
tmpl, err := tpl.New("adname").Parse(c.Adnametemplate) Year, Day, Month string
}
func OutDirName(conf *Config) (string, error) {
tmpl, err := tpl.New("outdir").Parse(conf.Outdir)
if err != nil { if err != nil {
return "", err return "", fmt.Errorf("failed to parse outdir template: %w", err)
} }
buf := bytes.Buffer{} buf := bytes.Buffer{}
err = tmpl.Execute(&buf, ad)
now := time.Now()
data := OutdirData{
Year: now.Format("2006"),
Month: now.Format("02"),
Day: now.Format("01"),
}
err = tmpl.Execute(&buf, data)
if err != nil { if err != nil {
return "", err return "", fmt.Errorf("failed to execute outdir template: %w", err)
} }
return buf.String(), nil return buf.String(), nil
} }
func WriteAd(c *Config, ad *Ad) (string, error) { func AdDirName(conf *Config, advertisement *Ad) (string, error) {
// prepare ad dir name tmpl, err := tpl.New("adname").Parse(conf.Adnametemplate)
addir, err := AdDirName(c, ad)
if err != nil { if err != nil {
return "", err return "", fmt.Errorf("failed to parse adname template: %w", err)
} }
// prepare output dir buf := bytes.Buffer{}
dir := filepath.Join(c.Outdir, addir)
err = Mkdir(dir) err = tmpl.Execute(&buf, advertisement)
if err != nil { if err != nil {
return "", err return "", fmt.Errorf("failed to execute adname template: %w", err)
}
return buf.String(), nil
}
func WriteAd(conf *Config, advertisement *Ad, addir string) error {
// prepare output dir
dir := filepath.Join(conf.Outdir, addir)
err := Mkdir(dir)
if err != nil {
return err
} }
// write ad file // write ad file
listingfile := filepath.Join(dir, "Adlisting.txt") listingfile := filepath.Join(dir, "Adlisting.txt")
f, err := os.Create(listingfile)
if err != nil {
return "", err
}
defer f.Close()
if runtime.GOOS == "windows" { listingfd, err := os.Create(listingfile)
ad.Text = strings.ReplaceAll(ad.Text, "<br/>", "\r\n") if err != nil {
return fmt.Errorf("failed to create Adlisting.txt: %w", err)
}
defer listingfd.Close()
if runtime.GOOS == WIN {
advertisement.Text = strings.ReplaceAll(advertisement.Text, "<br/>", "\r\n")
} else { } else {
ad.Text = strings.ReplaceAll(ad.Text, "<br/>", "\n") advertisement.Text = strings.ReplaceAll(advertisement.Text, "<br/>", "\n")
} }
tmpl, err := tpl.New("adlisting").Parse(c.Template) tmpl, err := tpl.New("adlisting").Parse(conf.Template)
if err != nil { if err != nil {
return "", err return fmt.Errorf("failed to parse adlisting template: %w", err)
} }
err = tmpl.Execute(f, ad) err = tmpl.Execute(listingfd, advertisement)
if err != nil { if err != nil {
return "", err return fmt.Errorf("failed to execute adlisting template: %w", err)
} }
slog.Info("wrote ad listing", "listingfile", listingfile) slog.Info("wrote ad listing", "listingfile", listingfile)
return addir, nil return nil
} }
func WriteImage(filename string, buf []byte) error { func WriteImage(filename string, reader *bytes.Reader) error {
file, err := os.Create(filename) file, err := os.Create(filename)
if err != nil { if err != nil {
return err return fmt.Errorf("failed to open image file: %w", err)
} }
defer file.Close() defer file.Close()
_, err = file.Write(buf) _, err = reader.WriteTo(file)
if err != nil { if err != nil {
return err return fmt.Errorf("failed to write to image file: %w", err)
} }
return nil return nil
@@ -111,12 +136,12 @@ func ReadImage(filename string) (*bytes.Buffer, error) {
data, err := os.ReadFile(filename) data, err := os.ReadFile(filename)
if err != nil { if err != nil {
return nil, err return nil, fmt.Errorf("failed to read image file: %w", err)
} }
_, err = buf.Write(data) _, err = buf.Write(data)
if err != nil { if err != nil {
return nil, err return nil, fmt.Errorf("failed to write image into buffer: %w", err)
} }
return &buf, nil return &buf, nil
@@ -127,5 +152,24 @@ func fileExists(filename string) bool {
if os.IsNotExist(err) { if os.IsNotExist(err) {
return false return false
} }
return !info.IsDir() return !info.IsDir()
} }
// check if an addir has already been processed by current run and
// decide what to do
func CheckAdVisited(conf *Config, adname string) bool {
if Exists(DirsVisited, adname) {
if conf.ForceDownload {
slog.Warn("an ad with the same name has already been downloaded, overwriting", "addir", adname)
return true
}
// don't overwrite
slog.Warn("an ad with the same name has already been downloaded, skipping (use -f to overwrite)", "addir", adname)
return false
}
// overwrite
return true
}

39
store_test.go Normal file
View File

@@ -0,0 +1,39 @@
/*
Copyright © 2023-2024 Thomas von Dein
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
package main
import (
"bytes"
"testing"
)
// this is a weird thing. WriteImage() is being called in scrape.go
// which is being tested by TestMain() in main_test.go. However, it
// doesn't show up in the coverage report for unknown reasons, so
// here's a single test for it
func TestWriteImage(t *testing.T) {
t.Parallel()
reader := bytes.NewReader([]byte{1, 2, 3, 4, 5, 6, 7, 8})
file := "t/out/t.jpg"
err := WriteImage(file, reader)
if err != nil {
t.Errorf("Could not write mock image to %s: %s", file, err.Error())
}
}

View File

@@ -1,6 +1,6 @@
# empty config for Main() unit tests to force unit tests NOT to use an # empty config for Main() unit tests to force unit tests NOT to use an
# eventually existing ~/.kleingebaeck! # eventually existing ~/.kleingebaeck!
template=""" template="""
{{.Title}}{{.Price}}{{.Id}}{{.Category}}{{.Condition}}{{.Created}} {{.Title}}{{.Price}}{{.ID}}{{.Category}}{{.Condition}}{{.Created}}
""" """

View File

@@ -2,5 +2,5 @@ user = 1
loglevel = "verbose" loglevel = "verbose"
outdir = "t/out" outdir = "t/out"
template=""" template="""
{{.Title}}{{.Price}}{{.Id}}{{.Category}}{{.Condition}}{{.Created}} {{.Title}}{{.Price}}{{.ID}}{{.Category}}{{.Condition}}{{.Created}}
""" """

View File

@@ -1,5 +1,7 @@
#!/bin/sh -x #!/bin/sh -x
base="../kleinanzeigen" base="../kleinanzeigen"
rm -rf $base
mkdir -p $base mkdir -p $base
echo "Generating /s-bestandsliste.html" echo "Generating /s-bestandsliste.html"

27
util.go
View File

@@ -1,5 +1,5 @@
/* /*
Copyright © 2023 Thomas von Dein Copyright © 2023-2024 Thomas von Dein
This program is free software: you can redistribute it and/or modify This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by it under the terms of the GNU General Public License as published by
@@ -20,9 +20,12 @@ package main
import ( import (
"bytes" "bytes"
"errors" "errors"
"fmt"
"math/rand"
"os" "os"
"os/exec" "os/exec"
"runtime" "runtime"
"time"
"github.com/mattn/go-isatty" "github.com/mattn/go-isatty"
) )
@@ -31,7 +34,7 @@ func Mkdir(dir string) error {
if _, err := os.Stat(dir); errors.Is(err, os.ErrNotExist) { if _, err := os.Stat(dir); errors.Is(err, os.ErrNotExist) {
err := os.Mkdir(dir, os.ModePerm) err := os.Mkdir(dir, os.ModePerm)
if err != nil { if err != nil {
return err return fmt.Errorf("failed to create directory %s: %w", dir, err)
} }
} }
@@ -42,7 +45,8 @@ func man() error {
man := exec.Command("less", "-") man := exec.Command("less", "-")
var b bytes.Buffer var b bytes.Buffer
b.Write([]byte(manpage))
b.WriteString(manpage)
man.Stdout = os.Stdout man.Stdout = os.Stdout
man.Stdin = &b man.Stdin = &b
@@ -51,7 +55,7 @@ func man() error {
err := man.Run() err := man.Run()
if err != nil { if err != nil {
return err return fmt.Errorf("failed to execute 'less': %w", err)
} }
return nil return nil
@@ -59,10 +63,23 @@ func man() error {
// returns TRUE if stdout is NOT a tty or windows // returns TRUE if stdout is NOT a tty or windows
func IsNoTty() bool { func IsNoTty() bool {
if runtime.GOOS == "windows" || !isatty.IsTerminal(os.Stdout.Fd()) { if runtime.GOOS == WIN || !isatty.IsTerminal(os.Stdout.Fd()) {
return true return true
} }
// it is a tty // it is a tty
return false return false
} }
func GetThrottleTime() time.Duration {
return time.Duration(rand.Intn(MaxThrottle-MinThrottle+1)+MinThrottle) * time.Millisecond
}
// look if a key in a map exists, generic variant
func Exists[K comparable, V any](m map[K]V, v K) bool {
if _, ok := m[v]; ok {
return true
}
return false
}