Compare commits

..

3 Commits

Author SHA1 Message Date
957f1a0c76 behavior changes: UserAgent configurable, test cookies, check errors 2024-01-24 19:19:31 +01:00
0def93d3cd add throttling to image download 2024-01-24 18:35:06 +01:00
9dde55a0f6 first step in fixing #49:
fetch cookies from 1st response and use them in subsequent requests.
2024-01-23 18:32:58 +01:00
24 changed files with 228 additions and 708 deletions

View File

@@ -5,4 +5,3 @@ title: "[bug-report]"
labels: bug
assignees: TLINDEN
---

View File

@@ -56,14 +56,6 @@ test: clean
mkdir -p t/out
go test ./... $(ARGS)
testlint: test lint
lint:
golangci-lint run
lint-full:
golangci-lint run --enable-all --exclude-use-default --disable exhaustivestruct,exhaustruct,depguard,interfacer,deadcode,golint,structcheck,scopelint,varcheck,ifshort,maligned,nosnakecase,godot,funlen,gofumpt,cyclop,noctx,gochecknoglobals,paralleltest
testfuzzy: clean
go test -fuzz ./... $(ARGS)
@@ -96,5 +88,5 @@ show-versions: buildlocal
@echo "### go version used for building:"
@grep -m 1 go go.mod
# lint:
# golangci-lint run -p bugs -p unused
lint:
golangci-lint run -p bugs -p unused

View File

@@ -222,49 +222,6 @@ Sowie alle Bilder.
Das Format kann man mit der Variable `template` in der Konfiguration
ändern. Die `example.conf` enthält ein Beispiel für das Standard Template.
## Verhalten des Tools
Es gibt einige Dinge über das Verhalten von kleingebäck, über die Du
Bescheid wissen solltest:
- alle HTML Seiten und Bilder werden immer heruntergeladen
- es wird ein (konfigurierbarer) Useragent verwendet
- HTTP Cookies werden beachtet
- bei Fehlern wird dreimal mit unterschiedlichem Abstand erneut
versucht
- Bilder Downloads laufen parallelisiert mit leicht unterschiedlichen
zeitlichen Abständen ab
- Gleich aussehende Bilder werden nicht überschrieben
Der letzte Punkt muss genauer erläutert werden:
Wenn man bei Kleinanzeigen.de eine Anzeige einstellt und Bilder
postet, werden diese dort in ihrer Grösse reduziert (durch Kompression
und Verkleinerung der Bilder usw.). Diese reduzierten Bilder werden
dann von kleingebäck heruntergeladen. Falls Du Deine original Bilder
behalten hast, kannst Du diese danach in das Backupverzeichnis
kopieren. Bei einem erneuten kleingebäck-Lauf werden diese Bilder dann
nicht überschrieben.
Wir verwenden dafür einen Algorythmus namens [distance
hashing](https://github.com/corona10/goimagehash). Dieser Algorithmus
prüft die Ähnlichkeit von Bildern. Diese können in ihrer Auflösung,
Kompression, Farbtiefe und vielem mehr manipuliert worden sein und
trotzdem als das "gleiche Bild" erkannt werden (wohlgemerkt nicht "das
selbe": die Dateien sind durchaus unterschiedlich!). Bis zu einer
Distance von 5 überschreiben wir keine Bilder, weil wir dann davon
ausgehen, dass das lokal Vorhandene das Original ist.
Bitte beachte aber, dass dies KEIN Cachingmechanismus ist: die Bilder
werden trotzdem immer alle heruntergeladen. Das muss so sein, da wir
uns nicht die Dateinamen anschauen können, da kleinanzeigen.de diese
nämlich zu Zahlen umbenennt. Und die Dateinamen können sich auch
ändern, wenn der User in der Anzeige die Bilder umarrangiert hat.
Du kannst dieses Verhalten mit der Option **--force** ausschalten. Du
kannst ausserdem mit der Option **--ignoreerrors** auch alle Fehler
ignorieren, die beim Bilderdownload auftreten könnten.
## Documentation
Die Dokumentation kann man

View File

@@ -207,48 +207,6 @@ variable. The supplied sample config contains the default template.
All images will be stored in the same directory.
## Tool Behavior
There are a bunch of things you might want to know about the behavior
of the kleingebäck tool:
- all HTML pages and IMAGEs are always being downloaded
- we use a (customizable) user agent
- we respect HTTP cookies
- in the case of an error, the tool does 3 retries, the time it waits
between tries is longer for each retry
- image download is parallized using small time differences to look
more natural
- same images are not being overwritten on subsequent download
The latter needs to be elaborated a bit more:
If you publish an ad on kleinanzeigen.de and post images, those images
will be reduced in size by the site (by compressing and down sizing
them). This reduced images will be downloaded by kleingebäck. However,
you may still own the original images and may want to put them into
that backup directory so that you have all things for one ad together.
You can easily do that, because kleingebäck won't overwrite those
original images. It uses something called a distance hash using
[goimagehash](https://github.com/corona10/goimagehash). This
algorithmus checks the similarity of images. If an image has been
resized it is still very similar to the original one. We accept a
maximum of a distance of 5, everything above leads to overwrite.
This works with resizes, cropped and otherwise manipulated images as
long as the image still shows the original contents good enough.
Also note, that this is NOT a caching mechanism: the images will be
downloaded anyway during each run. We also can't look at the file
names because kleinanzeigen.de renames all images to numbers. And
those might even change if the user re-arranges the images.
You can override this behavior using the **--force** option. Another
option, **--ignoreerrors**, can be used to ignore all kinds of image
errors.
## Documentation
You can read the documentation [online](https://github.com/TLINDEN/kleingebaeck/blob/main/kleingebaeck.pod) or locally once you have installed kleingebaeck with: `kleingebaeck --manual`.

8
ad.go
View File

@@ -1,5 +1,5 @@
/*
Copyright © 2023-2024 Thomas von Dein
Copyright © 2023 Thomas von Dein
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -30,7 +30,7 @@ type Index struct {
type Ad struct {
Title string `goquery:"h1"`
Slug string
ID string
Id string
Condition string `goquery:".addetailslist--detail--value,text"`
Category string
CategoryTree []string `goquery:".breadcrump-link,text"`
@@ -46,7 +46,7 @@ func (ad *Ad) LogValue() slog.Value {
return slog.GroupValue(
slog.String("title", ad.Title),
slog.String("price", ad.Price),
slog.String("id", ad.ID),
slog.String("id", ad.Id),
slog.Int("imagecount", len(ad.Images)),
slog.Int("bodysize", len(ad.Text)),
slog.String("categorytree", strings.Join(ad.CategoryTree, "+")),
@@ -76,7 +76,7 @@ func (ad *Ad) CalculateExpire() {
if len(ad.Created) > 0 {
ts, err := time.Parse("02.01.2006", ad.Created)
if err == nil {
ad.Expire = ts.AddDate(0, ExpireMonths, ExpireDays).Format("02.01.2006")
ad.Expire = ts.AddDate(0, 2, 1).Format("02.01.2006")
}
}
}

View File

@@ -17,6 +17,7 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
package main
import (
"errors"
"fmt"
"io"
"os"
@@ -34,16 +35,16 @@ import (
)
const (
VERSION string = "0.3.5"
VERSION string = "0.3.1"
Baseuri string = "https://www.kleinanzeigen.de"
Listuri string = "/s-bestandsliste.html"
Defaultdir string = "."
DefaultTemplate string = "Title: {{.Title}}\nPrice: {{.Price}}\nId: {{.ID}}\n" +
DefaultTemplate string = "Title: {{.Title}}\nPrice: {{.Price}}\nId: {{.Id}}\n" +
"Category: {{.Category}}\nCondition: {{.Condition}}\n" +
"Created: {{.Created}}\nExpire: {{.Expire}}\n\n{{.Text}}\n"
DefaultTemplateWin string = "Title: {{.Title}}\r\nPrice: {{.Price}}\r\nId: {{.ID}}\r\n" +
DefaultTemplateWin string = "Title: {{.Title}}\r\nPrice: {{.Price}}\r\nId: {{.Id}}\r\n" +
"Category: {{.Category}}\r\nCondition: {{.Condition}}\r\n" +
"Created: {{.Created}}\r\nExpires: {{.Expire}}\r\n\r\n{{.Text}}\r\n"
@@ -52,23 +53,11 @@ const (
DefaultAdNameTemplate string = "{{.Slug}}"
DefaultOutdirTemplate string = "."
// for image download throttling
MinThrottle int = 2
MaxThrottle int = 20
// we extract the slug from the uri
SlugURIPartNum int = 6
ExpireMonths int = 2
ExpireDays int = 1
WIN string = "windows"
)
var DirsVisited map[string]int
const Usage string = `This is kleingebaeck, the kleinanzeigen.de backup tool.
Usage: kleingebaeck [-dvVhmoclu] [<ad-listing-url>,...]
@@ -81,7 +70,7 @@ Options:
-l --limit <num> Limit the ads to download to <num>, default: load all.
-c --config <file> Use config file <file> (default: ~/.kleingebaeck).
--ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
-f --force Overwrite images and ads even if the already exist.
-f --force Download images even if they already exist.
-m --manual Show manual.
-h --help Show usage.
-V --version Show program version.
@@ -118,58 +107,55 @@ func (c *Config) IncrImgs(num int) {
}
// load commandline flags and config file
func InitConfig(output io.Writer) (*Config, error) {
var kloader = koanf.New(".")
func InitConfig(w io.Writer) (*Config, error) {
var k = koanf.New(".")
// determine template based on os
template := DefaultTemplate
if runtime.GOOS == WIN {
if runtime.GOOS == "windows" {
template = DefaultTemplateWin
}
// Load default values using the confmap provider.
if err := kloader.Load(confmap.Provider(map[string]interface{}{
if err := k.Load(confmap.Provider(map[string]interface{}{
"template": template,
"outdir": DefaultOutdirTemplate,
"outdir": ".",
"loglevel": "notice",
"userid": 0,
"adnametemplate": DefaultAdNameTemplate,
"useragent": DefaultUserAgent,
}, "."), nil); err != nil {
return nil, fmt.Errorf("failed to load default values into koanf: %w", err)
return nil, err
}
// setup custom usage
flagset := flag.NewFlagSet("config", flag.ContinueOnError)
flagset.Usage = func() {
fmt.Fprintln(output, Usage)
f := flag.NewFlagSet("config", flag.ContinueOnError)
f.Usage = func() {
fmt.Fprintln(w, Usage)
os.Exit(0)
}
// parse commandline flags
flagset.StringP("config", "c", "", "config file")
flagset.StringP("outdir", "o", "", "directory where to store ads")
flagset.IntP("user", "u", 0, "user id")
flagset.IntP("limit", "l", 0, "limit ads to be downloaded (default 0, unlimited)")
flagset.BoolP("verbose", "v", false, "be verbose")
flagset.BoolP("debug", "d", false, "enable debug log")
flagset.BoolP("version", "V", false, "show program version")
flagset.BoolP("help", "h", false, "show usage")
flagset.BoolP("manual", "m", false, "show manual")
flagset.BoolP("force", "f", false, "force")
flagset.BoolP("ignoreerrors", "", false, "ignore image download HTTP errors")
f.StringP("config", "c", "", "config file")
f.StringP("outdir", "o", "", "directory where to store ads")
f.IntP("user", "u", 0, "user id")
f.IntP("limit", "l", 0, "limit ads to be downloaded (default 0, unlimited)")
f.BoolP("verbose", "v", false, "be verbose")
f.BoolP("debug", "d", false, "enable debug log")
f.BoolP("version", "V", false, "show program version")
f.BoolP("help", "h", false, "show usage")
f.BoolP("manual", "m", false, "show manual")
f.BoolP("force", "f", false, "force")
if err := flagset.Parse(os.Args[1:]); err != nil {
return nil, fmt.Errorf("failed to parse program arguments: %w", err)
if err := f.Parse(os.Args[1:]); err != nil {
return nil, err
}
// generate a list of config files to try to load, including the
// one provided via -c, if any
var configfiles []string
configfile, _ := flagset.GetString("config")
configfile, _ := f.GetString("config")
home, _ := os.UserHomeDir()
if configfile != "" {
configfiles = []string{configfile}
} else {
@@ -185,30 +171,31 @@ func InitConfig(output io.Writer) (*Config, error) {
for _, cfgfile := range configfiles {
if path, err := os.Stat(cfgfile); !os.IsNotExist(err) {
if !path.IsDir() {
if err := kloader.Load(file.Provider(cfgfile), toml.Parser()); err != nil {
return nil, fmt.Errorf("error loading config file: %w", err)
if err := k.Load(file.Provider(cfgfile), toml.Parser()); err != nil {
return nil, errors.New("error loading config file: " + err.Error())
}
}
} // else: we ignore the file if it doesn't exists
}
// else: we ignore the file if it doesn't exists
}
// env overrides config file
if err := kloader.Load(env.Provider("KLEINGEBAECK_", ".", func(s string) string {
return strings.ReplaceAll(strings.ToLower(
strings.TrimPrefix(s, "KLEINGEBAECK_")), "_", ".")
if err := k.Load(env.Provider("KLEINGEBAECK_", ".", func(s string) string {
return strings.Replace(strings.ToLower(
strings.TrimPrefix(s, "KLEINGEBAECK_")), "_", ".", -1)
}), nil); err != nil {
return nil, fmt.Errorf("error loading environment: %w", err)
return nil, errors.New("error loading environment: " + err.Error())
}
// command line overrides env
if err := kloader.Load(posflag.Provider(flagset, ".", kloader), nil); err != nil {
return nil, fmt.Errorf("error loading flags: %w", err)
if err := k.Load(posflag.Provider(f, ".", k), nil); err != nil {
return nil, errors.New("error loading flags: " + err.Error())
}
// fetch values
conf := &Config{}
if err := kloader.Unmarshal("", &conf); err != nil {
return nil, fmt.Errorf("error unmarshalling: %w", err)
if err := k.Unmarshal("", &conf); err != nil {
return nil, errors.New("error unmarshalling: " + err.Error())
}
// adjust loglevel
@@ -220,7 +207,7 @@ func InitConfig(output io.Writer) (*Config, error) {
}
// are there any args left on commandline? if so threat them as adlinks
conf.Adlinks = flagset.Args()
conf.Adlinks = f.Args()
return conf, nil
}

View File

@@ -19,7 +19,6 @@ package main
import (
"errors"
"fmt"
"io"
"log/slog"
"net/http"
@@ -34,10 +33,10 @@ type Fetcher struct {
Cookies []*http.Cookie
}
func NewFetcher(conf *Config) (*Fetcher, error) {
func NewFetcher(c *Config) (*Fetcher, error) {
jar, err := cookiejar.New(nil)
if err != nil {
return nil, fmt.Errorf("failed to create a cookie jar obj: %w", err)
return nil, err
}
return &Fetcher{
@@ -45,37 +44,35 @@ func NewFetcher(conf *Config) (*Fetcher, error) {
Transport: &loggingTransport{}, // implemented in http.go
Jar: jar,
},
Config: conf,
Config: c,
Cookies: []*http.Cookie{},
},
nil
}
func (f *Fetcher) Get(uri string) (io.ReadCloser, error) {
req, err := http.NewRequest(http.MethodGet, uri, nil)
req, err := http.NewRequest("GET", uri, nil)
if err != nil {
return nil, fmt.Errorf("failed to create a new HTTP request obj: %w", err)
return nil, err
}
req.Header.Set("User-Agent", f.Config.UserAgent)
if len(f.Cookies) > 0 {
uriobj, _ := url.Parse(Baseuri)
slog.Debug("have cookies, sending them",
"sample-cookie-name", f.Cookies[0].Name,
"sample-cookie-expire", f.Cookies[0].Expires,
)
f.Client.Jar.SetCookies(uriobj, f.Cookies)
}
res, err := f.Client.Do(req)
if err != nil {
return nil, fmt.Errorf("failed to initiate HTTP request to %s: %w", uri, err)
return nil, err
}
if res.StatusCode != http.StatusOK {
if res.StatusCode != 200 {
return nil, errors.New("could not get page via HTTP")
}
@@ -88,15 +85,12 @@ func (f *Fetcher) Get(uri string) (io.ReadCloser, error) {
// fetch an image
func (f *Fetcher) Getimage(uri string) (io.ReadCloser, error) {
slog.Debug("fetching ad image", "uri", uri)
body, err := f.Get(uri)
if err != nil {
if f.Config.IgnoreErrors {
slog.Info("Failed to download image, error ignored", "error", err.Error())
return nil, nil
}
return nil, err
}

5
go.mod
View File

@@ -14,7 +14,7 @@ require (
github.com/lmittmann/tint v1.0.4
github.com/mattn/go-isatty v0.0.20
github.com/spf13/pflag v1.0.5
github.com/tlinden/yadu v0.1.2
github.com/tlinden/yadu v0.1.1
golang.org/x/sync v0.5.0
)
@@ -31,9 +31,8 @@ require (
github.com/mitchellh/reflectwalk v1.0.2 // indirect
github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646 // indirect
github.com/pelletier/go-toml v1.9.5 // indirect
github.com/pkg/errors v0.9.1 // indirect
golang.org/x/net v0.0.0-20220722155237-a158d28d115b // indirect
golang.org/x/sys v0.17.0 // indirect
golang.org/x/sys v0.14.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)

6
go.sum
View File

@@ -50,8 +50,6 @@ github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646 h1:zYyBkD/k9seD2A7fsi6
github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646/go.mod h1:jpp1/29i3P1S/RLdc7JQKbRpFeM1dOBd8T9ki5s+AY8=
github.com/pelletier/go-toml v1.9.5 h1:4yBQzkHv+7BHq2PQUZF3Mx0IYxG7LsP222s7Agd3ve8=
github.com/pelletier/go-toml v1.9.5/go.mod h1:u1nR/EPcESfeI/szUZKdtJ0xRNbUoANCkoOuaOx1Y+c=
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
@@ -66,8 +64,6 @@ github.com/tlinden/yadu v0.1.0 h1:qtCi1jxg392qVRLFyrJ2LYu6/PiKSp1LT02EX+mNLME=
github.com/tlinden/yadu v0.1.0/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA=
github.com/tlinden/yadu v0.1.1 h1:116oEUy9b4PcMF5wLL2dCFA/sn/praYutOnao07MROw=
github.com/tlinden/yadu v0.1.1/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA=
github.com/tlinden/yadu v0.1.2 h1:TYYVnUJwziRJ9YPbIbRf9ikmDw0Q8Ifixm+J/kBQFh8=
github.com/tlinden/yadu v0.1.2/go.mod h1:l3bRmHKL9zGAR6pnBHY2HRPxBecf7L74BoBgOOpTcUA=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/net v0.0.0-20180218175443-cbe0f9307d01/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
@@ -83,8 +79,6 @@ golang.org/x/sys v0.0.0-20220908164124-27713097b956/go.mod h1:oPkhp1MJrh7nUepCBc
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.14.0 h1:Vz7Qs629MkJkGyHxUlRHizWJRG2j8fbQKjELVSNhy7Q=
golang.org/x/sys v0.14.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/sys v0.17.0 h1:25cE3gD+tdBA7lp7QfhuV+rJiE9YXTcS3VG1SqssI/Y=
golang.org/x/sys v0.17.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=

31
http.go
View File

@@ -19,7 +19,6 @@ package main
import (
"bytes"
"fmt"
"io"
"log/slog"
"math"
@@ -33,20 +32,17 @@ import (
// easier associated in debug output
var letters = []rune("ABCDEF0123456789")
const IDLEN int = 8
// retry after HTTP 50x errors or err!=nil
const RetryCount = 3
func getid() string {
b := make([]rune, IDLEN)
b := make([]rune, 8)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
// retry after HTTP 50x errors or err!=nil
const RetryCount = 3
// used to inject debug log and implement retries
type loggingTransport struct{}
@@ -79,7 +75,6 @@ func drainBody(resp *http.Response) {
// unable to copy data? uff!
panic(err)
}
resp.Body.Close()
}
}
@@ -87,8 +82,8 @@ func drainBody(resp *http.Response) {
// the actual logging transport with retries
func (t *loggingTransport) RoundTrip(req *http.Request) (*http.Response, error) {
// just required for debugging
requestid := getid()
// just requred for debugging
id := getid()
// clone the request body, put into request on retry
var bodyBytes []byte
@@ -97,16 +92,16 @@ func (t *loggingTransport) RoundTrip(req *http.Request) (*http.Response, error)
req.Body = io.NopCloser(bytes.NewBuffer(bodyBytes))
}
slog.Debug("REQUEST", "id", requestid, "uri", req.URL, "host", req.Host)
slog.Debug("REQUEST", "id", id, "uri", req.URL, "host", req.Host)
// first try
resp, err := http.DefaultTransport.RoundTrip(req)
if err == nil {
slog.Debug("RESPONSE", "id", requestid, "status", resp.StatusCode,
slog.Debug("RESPONSE", "id", id, "status", resp.StatusCode,
"contentlength", resp.ContentLength)
}
// enter retry check and loop, if first req were successful, leave loop immediately
// enter retry check and loop, if first req were successfull, leave loop immediately
retries := 0
for shouldRetry(err, resp) && retries < RetryCount {
time.Sleep(backoff(retries))
@@ -123,16 +118,12 @@ func (t *loggingTransport) RoundTrip(req *http.Request) (*http.Response, error)
resp, err = http.DefaultTransport.RoundTrip(req)
if err == nil {
slog.Debug("RESPONSE", "id", requestid, "status", resp.StatusCode,
slog.Debug("RESPONSE", "id", id, "status", resp.StatusCode,
"contentlength", resp.ContentLength, "retry", retries)
}
retries++
}
if err != nil {
return resp, fmt.Errorf("failed to get HTTP response for %s: %w", req.URL, err)
}
return resp, nil
return resp, err
}

View File

@@ -19,7 +19,6 @@ package main
import (
"bytes"
"fmt"
"image/jpeg"
"log/slog"
"os"
@@ -33,15 +32,15 @@ const MaxDistance = 3
type Image struct {
Filename string
Hash *goimagehash.ImageHash
Data *bytes.Reader
URI string
Data *bytes.Buffer
Uri string
}
// used for logging to avoid printing Data
func (img *Image) LogValue() slog.Value {
return slog.GroupValue(
slog.String("filename", img.Filename),
slog.String("uri", img.URI),
slog.String("uri", img.Uri),
slog.String("hash", img.Hash.ToString()),
)
}
@@ -49,10 +48,10 @@ func (img *Image) LogValue() slog.Value {
// holds all images of an ad
type Cache []*goimagehash.ImageHash
func NewImage(buf *bytes.Reader, filename string, uri string) *Image {
func NewImage(buf *bytes.Buffer, filename string, uri string) *Image {
img := &Image{
Filename: filename,
URI: uri,
Uri: uri,
Data: buf,
}
@@ -63,12 +62,12 @@ func NewImage(buf *bytes.Reader, filename string, uri string) *Image {
func (img *Image) CalcHash() error {
jpgdata, err := jpeg.Decode(img.Data)
if err != nil {
return fmt.Errorf("failed to decode JPEG image: %w", err)
return err
}
hash1, err := goimagehash.DifferenceHash(jpgdata)
if err != nil {
return fmt.Errorf("failed to calculate diff hash of image: %w", err)
return err
}
img.Hash = hash1
@@ -81,18 +80,16 @@ func (img *Image) Similar(hash *goimagehash.ImageHash) bool {
distance, err := img.Hash.Distance(hash)
if err != nil {
slog.Debug("failed to compute diff hash distance", "error", err)
return false
}
if distance < MaxDistance {
slog.Debug("distance computation", "image-A", img.Hash.ToString(),
"image-B", hash.ToString(), "distance", distance)
return true
} else {
return false
}
return false
}
// check current image against all known hashes.
@@ -111,7 +108,7 @@ func (img *Image) SimilarExists(cache Cache) bool {
func ReadImages(addir string, dont bool) (Cache, error) {
files, err := os.ReadDir(addir)
if err != nil {
return nil, fmt.Errorf("failed to read ad directory contents: %w", err)
return nil, err
}
cache := Cache{}
@@ -125,15 +122,12 @@ func ReadImages(addir string, dont bool) (Cache, error) {
ext := filepath.Ext(file.Name())
if !file.IsDir() && (ext == ".jpg" || ext == ".jpeg" || ext == ".JPG" || ext == ".JPEG") {
filename := filepath.Join(addir, file.Name())
data, err := ReadImage(filename)
if err != nil {
return nil, err
}
reader := bytes.NewReader(data.Bytes())
img := NewImage(reader, filename, "")
img := NewImage(data, filename, "")
if err = img.CalcHash(); err != nil {
return nil, err
}
@@ -143,5 +137,6 @@ func ReadImages(addir string, dont bool) (Cache, error) {
}
}
//return nil, errors.New("ende")
return cache, nil
}

View File

@@ -133,7 +133,7 @@
.\" ========================================================================
.\"
.IX Title "KLEINGEBAECK 1"
.TH KLEINGEBAECK 1 "2024-02-10" "1" "User Commands"
.TH KLEINGEBAECK 1 "2024-01-24" "1" "User Commands"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
@@ -152,7 +152,7 @@ kleingebaeck \- kleinanzeigen.de backup tool
\& \-l \-\-limit <num> Limit the ads to download to <num>, default: load all.
\& \-c \-\-config <file> Use config file <file> (default: ~/.kleingebaeck).
\& \-\-ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
\& \-f \-\-force Overwrite images and ads even if the already exist.
\& \-f \-\-force Download images even if they already exist.
\& \-m \-\-manual Show manual.
\& \-h \-\-help Show usage.
\& \-V \-\-version Show program version.
@@ -182,7 +182,7 @@ Format is pretty simple:
\& template = """
\& Title: {{.Title}}
\& Price: {{.Price}}
\& Id: {{.ID}}
\& Id: {{.Id}}
\& Category: {{.Category}}
\& Condition: {{.Condition}}
\& Created: {{.Created}}
@@ -191,11 +191,11 @@ Format is pretty simple:
\& """
.Ve
.PP
Be careful if you want to change the template. The variable is a
Be carefull if you want to change the template. The variable is a
multiline string surrounded by three double quotes. You can left out
certain fields and use any formatting you like. Refer to
<https://pkg.go.dev/text/template> for details how to write a
template. Also read the \s-1TEMPLATES\s0 section below.
template.
.PP
If you're on windows and want to customize the output directory, put
it into single quotes to avoid the backslashes interpreted as escape
@@ -204,94 +204,6 @@ chars like this:
.Vb 1
\& outdir = \*(AqC:\eData\eAds\*(Aq
.Ve
.SH "TEMPLATES"
.IX Header "TEMPLATES"
Various parts of the configuration can be modified using templates:
the output directory, the ad directory and the ad listing itself.
.SS "\s-1OUTPUT DIR TEMPLATE\s0"
.IX Subsection "OUTPUT DIR TEMPLATE"
The config varialbe \f(CW\*(C`outdir\*(C'\fR or the command line parameter \f(CW\*(C`\-o\*(C'\fR take a
template which may contain:
.ie n .IP """{{.Year}}""" 4
.el .IP "\f(CW{{.Year}}\fR" 4
.IX Item "{{.Year}}"
.PD 0
.ie n .IP """{{.Month}}""" 4
.el .IP "\f(CW{{.Month}}\fR" 4
.IX Item "{{.Month}}"
.ie n .IP """{{.Day}}""" 4
.el .IP "\f(CW{{.Day}}\fR" 4
.IX Item "{{.Day}}"
.PD
.PP
That way you can create a new output directory for every backup
run. For example:
.PP
.Vb 1
\& outdir = "/home/backups/ads\-{{.Year}}\-{{.Month}}\-{{.Day}}"
.Ve
.PP
Or using the command line flag:
.PP
.Vb 1
\& \-o "/home/backups/ads\-{{.Year}}\-{{.Month}}\-{{.Day}}"
.Ve
.PP
The default value is \f(CW\*(C`.\*(C'\fR \- the current directory.
.SS "\s-1AD DIRECTORY TEMPLATE\s0"
.IX Subsection "AD DIRECTORY TEMPLATE"
The ad directory name can be modified using the following ad values:
.IP "{{.Price}}" 4
.IX Item "{{.Price}}"
.PD 0
.IP "{{.ID}}" 4
.IX Item "{{.ID}}"
.IP "{{.Category}}" 4
.IX Item "{{.Category}}"
.IP "{{.Condition}}" 4
.IX Item "{{.Condition}}"
.IP "{{.Created}}" 4
.IX Item "{{.Created}}"
.IP "{{.Slug}}" 4
.IX Item "{{.Slug}}"
.IP "{{.Text}}" 4
.IX Item "{{.Text}}"
.PD
.PP
It can only be configured in the config file. By default only
\&\f(CW\*(C`{{.Slug}}\*(C'\fR is being used, this is the title of the ad in url format.
.SS "\s-1AD TEMPLATE\s0"
.IX Subsection "AD TEMPLATE"
The ad listing itself can be modified as well, using the same
variables as the ad name template above.
.PP
This is the default template:
.PP
.Vb 7
\& Title: {{.Title}}
\& Price: {{.Price}}
\& Id: {{.ID}}
\& Category: {{.Category}}
\& Condition: {{.Condition}}
\& Created: {{.Created}}
\& Expire: {{.Expire}}
\&
\& {{.Text}}
.Ve
.PP
The config parameter to modify is \f(CW\*(C`template\*(C'\fR. See example.conf in the
source repository. Please take care, since this is a multiline
string. This is how it shall look if you modify it:
.PP
.Vb 2
\& template="""
\& Title: {{.Title}}
\&
\& {{.Text}}
\& """
.Ve
.PP
That is, the content between the two \f(CW"""\fR chars is the template.
.SH "SETUP"
.IX Header "SETUP"
To setup the tool, you need to lookup your userid on

View File

@@ -14,7 +14,7 @@ SYNOPSYS
-l --limit <num> Limit the ads to download to <num>, default: load all.
-c --config <file> Use config file <file> (default: ~/.kleingebaeck).
--ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
-f --force Overwrite images and ads even if the already exist.
-f --force Download images even if they already exist.
-m --manual Show manual.
-h --help Show usage.
-V --version Show program version.
@@ -43,7 +43,7 @@ CONFIGURATION
template = """
Title: {{.Title}}
Price: {{.Price}}
Id: {{.ID}}
Id: {{.Id}}
Category: {{.Category}}
Condition: {{.Condition}}
Created: {{.Created}}
@@ -51,11 +51,10 @@ CONFIGURATION
{{.Text}}
"""
Be careful if you want to change the template. The variable is a
Be carefull if you want to change the template. The variable is a
multiline string surrounded by three double quotes. You can left out
certain fields and use any formatting you like. Refer to
<https://pkg.go.dev/text/template> for details how to write a template.
Also read the TEMPLATES section below.
If you're on windows and want to customize the output directory, put it
into single quotes to avoid the backslashes interpreted as escape chars
@@ -63,71 +62,6 @@ CONFIGURATION
outdir = 'C:\Data\Ads'
TEMPLATES
Various parts of the configuration can be modified using templates: the
output directory, the ad directory and the ad listing itself.
OUTPUT DIR TEMPLATE
The config varialbe "outdir" or the command line parameter "-o" take a
template which may contain:
"{{.Year}}"
"{{.Month}}"
"{{.Day}}"
That way you can create a new output directory for every backup run. For
example:
outdir = "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
Or using the command line flag:
-o "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
The default value is "." - the current directory.
AD DIRECTORY TEMPLATE
The ad directory name can be modified using the following ad values:
{{.Price}}
{{.ID}}
{{.Category}}
{{.Condition}}
{{.Created}}
{{.Slug}}
{{.Text}}
It can only be configured in the config file. By default only
"{{.Slug}}" is being used, this is the title of the ad in url format.
AD TEMPLATE
The ad listing itself can be modified as well, using the same variables
as the ad name template above.
This is the default template:
Title: {{.Title}}
Price: {{.Price}}
Id: {{.ID}}
Category: {{.Category}}
Condition: {{.Condition}}
Created: {{.Created}}
Expire: {{.Expire}}
{{.Text}}
The config parameter to modify is "template". See example.conf in the
source repository. Please take care, since this is a multiline string.
This is how it shall look if you modify it:
template="""
Title: {{.Title}}
{{.Text}}
"""
That is, the content between the two """ chars is the template.
SETUP
To setup the tool, you need to lookup your userid on kleinanzeigen.de.
Go to your ad overview page while NOT being logged in:

View File

@@ -13,7 +13,7 @@ kleingebaeck - kleinanzeigen.de backup tool
-l --limit <num> Limit the ads to download to <num>, default: load all.
-c --config <file> Use config file <file> (default: ~/.kleingebaeck).
--ignoreerrors Ignore HTTP errors, may lead to incomplete ad backup.
-f --force Overwrite images and ads even if the already exist.
-f --force Download images even if they already exist.
-m --manual Show manual.
-h --help Show usage.
-V --version Show program version.
@@ -43,7 +43,7 @@ Format is pretty simple:
template = """
Title: {{.Title}}
Price: {{.Price}}
Id: {{.ID}}
Id: {{.Id}}
Category: {{.Category}}
Condition: {{.Condition}}
Created: {{.Created}}
@@ -51,11 +51,11 @@ Format is pretty simple:
{{.Text}}
"""
Be careful if you want to change the template. The variable is a
Be carefull if you want to change the template. The variable is a
multiline string surrounded by three double quotes. You can left out
certain fields and use any formatting you like. Refer to
L<https://pkg.go.dev/text/template> for details how to write a
template. Also read the TEMPLATES section below.
template.
If you're on windows and want to customize the output directory, put
it into single quotes to avoid the backslashes interpreted as escape
@@ -63,91 +63,6 @@ chars like this:
outdir = 'C:\Data\Ads'
=head1 TEMPLATES
Various parts of the configuration can be modified using templates:
the output directory, the ad directory and the ad listing itself.
=head2 OUTPUT DIR TEMPLATE
The config varialbe C<outdir> or the command line parameter C<-o> take a
template which may contain:
=over
=item C<{{.Year}}>
=item C<{{.Month}}>
=item C<{{.Day}}>
=back
That way you can create a new output directory for every backup
run. For example:
outdir = "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
Or using the command line flag:
-o "/home/backups/ads-{{.Year}}-{{.Month}}-{{.Day}}"
The default value is C<.> - the current directory.
=head2 AD DIRECTORY TEMPLATE
The ad directory name can be modified using the following ad values:
=over
=item {{.Price}}
=item {{.ID}}
=item {{.Category}}
=item {{.Condition}}
=item {{.Created}}
=item {{.Slug}}
=item {{.Text}}
=back
It can only be configured in the config file. By default only
C<{{.Slug}}> is being used, this is the title of the ad in url format.
=head2 AD TEMPLATE
The ad listing itself can be modified as well, using the same
variables as the ad name template above.
This is the default template:
Title: {{.Title}}
Price: {{.Price}}
Id: {{.ID}}
Category: {{.Category}}
Condition: {{.Condition}}
Created: {{.Created}}
Expire: {{.Expire}}
{{.Text}}
The config parameter to modify is C<template>. See example.conf in the
source repository. Please take care, since this is a multiline
string. This is how it shall look if you modify it:
template="""
Title: {{.Title}}
{{.Text}}
"""
That is, the content between the two C<"""> chars is the template.
=head1 SETUP
To setup the tool, you need to lookup your userid on

48
main.go
View File

@@ -22,8 +22,10 @@ import (
"fmt"
"io"
"log/slog"
"math/rand"
"os"
"runtime/debug"
"time"
"github.com/lmittmann/tint"
"github.com/tlinden/yadu"
@@ -35,43 +37,38 @@ func main() {
os.Exit(Main(os.Stdout))
}
func Main(output io.Writer) int {
func Main(w io.Writer) int {
logLevel := &slog.LevelVar{}
opts := &tint.Options{
Level: logLevel,
AddSource: false,
ReplaceAttr: func(groups []string, attr slog.Attr) slog.Attr {
ReplaceAttr: func(groups []string, a slog.Attr) slog.Attr {
// Remove time from the output
if attr.Key == slog.TimeKey {
if a.Key == slog.TimeKey {
return slog.Attr{}
}
return attr
return a
},
NoColor: IsNoTty(),
}
logLevel.Set(LevelNotice)
handler := tint.NewHandler(output, opts)
handler := tint.NewHandler(w, opts)
logger := slog.New(handler)
slog.SetDefault(logger)
conf, err := InitConfig(output)
conf, err := InitConfig(w)
if err != nil {
return Die(err)
}
if conf.Showversion {
fmt.Fprintf(output, "This is kleingebaeck version %s\n", VERSION)
fmt.Fprintf(w, "This is kleingebaeck version %s\n", VERSION)
return 0
}
if conf.Showhelp {
fmt.Fprintln(output, Usage)
fmt.Fprintln(w, Usage)
return 0
}
@@ -80,7 +77,6 @@ func Main(output io.Writer) int {
if err != nil {
return Die(err)
}
return 0
}
@@ -98,8 +94,7 @@ func Main(output io.Writer) int {
}
logLevel.Set(slog.LevelDebug)
handler := yadu.NewHandler(output, opts)
handler := yadu.NewHandler(w, opts)
debuglogger := slog.New(handler).With(
slog.Group("program_info",
slog.Int("pid", os.Getpid()),
@@ -112,11 +107,10 @@ func Main(output io.Writer) int {
slog.Debug("config", "conf", conf)
// prepare output dir
outdir, err := OutDirName(conf)
err = Mkdir(conf.Outdir)
if err != nil {
return Die(err)
}
conf.Outdir = outdir
// used for all HTTP requests
fetch, err := NewFetcher(conf)
@@ -124,11 +118,10 @@ func Main(output io.Writer) int {
return Die(err)
}
// setup ad dir registry, needed to check for duplicates
DirsVisited = make(map[string]int)
// randomization needed here and there
rand.Seed(time.Now().UnixNano())
switch {
case len(conf.Adlinks) >= 1:
if len(conf.Adlinks) >= 1 {
// directly backup ad listing[s]
for _, uri := range conf.Adlinks {
err := ScrapeAd(fetch, uri)
@@ -136,27 +129,25 @@ func Main(output io.Writer) int {
return Die(err)
}
}
case conf.User > 0:
} else if conf.User > 0 {
// backup all ads of the given user (via config or cmdline)
err := ScrapeUser(fetch)
if err != nil {
return Die(err)
}
default:
} else {
return Die(errors.New("invalid or no user id or no ad link specified"))
}
if conf.StatsCountAds > 0 {
adstr := "ads"
if conf.StatsCountAds == 1 {
adstr = "ad"
}
fmt.Fprintf(output, "Successfully downloaded %d %s with %d images to %s.\n",
fmt.Fprintf(w, "Successfully downloaded %d %s with %d images to %s.\n",
conf.StatsCountAds, adstr, conf.StatsCountImages, conf.Outdir)
} else {
fmt.Fprintf(output, "No ads found.")
fmt.Fprintf(w, "No ads found.")
}
return 0
@@ -164,6 +155,5 @@ func Main(output io.Writer) int {
func Die(err error) int {
slog.Error("Failure", "error", err.Error())
return 1
}

View File

@@ -43,7 +43,7 @@ const LISTTPL string = `<!DOCTYPE html>
{{ range . }}
<h2 class="text-module-begin">
<a class="ellipsis"
href="/s-anzeige/{{ .Slug }}/{{ .ID }}">{{ .Title }}</a>
href="/s-anzeige/{{ .Slug }}/{{ .Id }}">{{ .Title }}</a>
</h2>
{{ end }}
</body>
@@ -247,7 +247,7 @@ var invalidtests = []Tests{
type AdConfig struct {
Title string
Slug string
ID string
Id string
Price string
Category string
Condition string
@@ -259,7 +259,7 @@ type AdConfig struct {
var adsrc = []AdConfig{
{
Title: "First Ad",
ID: "1", Price: "5€",
Id: "1", Price: "5€",
Category: "Klimbim",
Text: "Thing to sale",
Slug: "first-ad",
@@ -269,7 +269,7 @@ var adsrc = []AdConfig{
},
{
Title: "Secnd Ad",
ID: "2", Price: "5€",
Id: "2", Price: "5€",
Category: "Kram",
Text: "Thing to sale",
Slug: "second-ad",
@@ -279,7 +279,7 @@ var adsrc = []AdConfig{
},
{
Title: "Third Ad",
ID: "3",
Id: "3",
Price: "5€",
Category: "Kuddelmuddel",
Text: "Thing to sale",
@@ -290,7 +290,7 @@ var adsrc = []AdConfig{
},
{
Title: "Forth Ad",
ID: "4",
Id: "4",
Price: "5€",
Category: "Krempel",
Text: "Thing to sale",
@@ -301,7 +301,7 @@ var adsrc = []AdConfig{
},
{
Title: "Fifth Ad",
ID: "5",
Id: "5",
Price: "5€",
Category: "Kladderadatsch",
Text: "Thing to sale",
@@ -312,7 +312,7 @@ var adsrc = []AdConfig{
},
{
Title: "Sixth Ad",
ID: "6",
Id: "6",
Price: "5€",
Category: "Klunker",
Text: "Thing to sale",
@@ -334,17 +334,17 @@ type Adsource struct {
}
// Render a HTML template for an adlisting or an ad
func GetTemplate(adconfigs []AdConfig, adconfig AdConfig, htmltemplate string) string {
func GetTemplate(l []AdConfig, a AdConfig, htmltemplate string) string {
tmpl, err := tpl.New("template").Parse(htmltemplate)
if err != nil {
panic(err)
}
var out bytes.Buffer
if len(adconfig.ID) == 0 {
err = tmpl.Execute(&out, adconfigs)
if len(a.Id) == 0 {
err = tmpl.Execute(&out, l)
} else {
err = tmpl.Execute(&out, adconfig)
err = tmpl.Execute(&out, a)
}
if err != nil {
@@ -391,9 +391,10 @@ func InitValidSources() []Adsource {
// prepare urls for the ads
for _, ad := range adsrc {
ads = append(ads, Adsource{
uri: fmt.Sprintf("%s/s-anzeige/%s/%s", Baseuri, ad.Slug, ad.ID),
uri: fmt.Sprintf("%s/s-anzeige/%s/%s", Baseuri, ad.Slug, ad.Id),
content: GetTemplate(nil, ad, ADTPL),
})
//panic(GetTemplate(nil, ad, ADTPL))
}
return ads
@@ -446,48 +447,46 @@ func GetImage(path string) []byte {
// setup httpmock
func SetIntercept(ads []Adsource) {
headers := http.Header{}
headers.Add("Set-Cookie", "session=permanent")
ch := http.Header{}
ch.Add("Set-Cookie", "session=permanent")
for _, advertisement := range ads {
if advertisement.status == 0 {
advertisement.status = 200
for _, ad := range ads {
if ad.status == 0 {
ad.status = 200
}
httpmock.RegisterResponder("GET", advertisement.uri,
httpmock.NewStringResponder(advertisement.status, advertisement.content).HeaderAdd(headers))
httpmock.RegisterResponder("GET", ad.uri,
httpmock.NewStringResponder(ad.status, ad.content).HeaderAdd(ch))
}
// we just use 2 images, put this here
for _, image := range []string{"t/1.jpg", "t/2.jpg"} {
httpmock.RegisterResponder("GET", image,
httpmock.NewBytesResponder(200, GetImage(image)).HeaderAdd(headers))
httpmock.NewBytesResponder(200, GetImage(image)).HeaderAdd(ch))
}
}
func VerifyAd(advertisement AdConfig) error {
body := advertisement.Title + advertisement.Price + advertisement.ID + "Kleinanzeigen => " +
advertisement.Category + advertisement.Condition + advertisement.Created
func VerifyAd(ad AdConfig) error {
body := ad.Title + ad.Price + ad.Id + "Kleinanzeigen => " +
ad.Category + ad.Condition + ad.Created
// prepare ad dir name using DefaultAdNameTemplate
c := Config{Adnametemplate: "{{ .Slug }}"}
adstruct := Ad{Slug: advertisement.Slug, ID: advertisement.ID}
adstruct := Ad{Slug: ad.Slug, Id: ad.Id}
addir, err := AdDirName(&c, &adstruct)
if err != nil {
return err
}
file := fmt.Sprintf("t/out/%s/Adlisting.txt", addir)
content, err := os.ReadFile(file)
if err != nil {
return fmt.Errorf("unable to read adlisting file: %w", err)
return err
}
if body != strings.TrimSpace(string(content)) {
msg := fmt.Sprintf("ad content doesn't match.\nExpect: %s\n Got: %s\n", body, content)
return errors.New(msg)
}
@@ -505,21 +504,20 @@ func TestMain(t *testing.T) {
SetIntercept(InitValidSources())
// run commandline tests
for _, test := range tests {
for _, tt := range tests {
var buf bytes.Buffer
os.Args = strings.Split(test.args, " ")
os.Args = strings.Split(tt.args, " ")
ret := Main(&buf)
if ret != test.exitcode {
if ret != tt.exitcode {
t.Errorf("%s with cmd <%s> did not exit with %d but %d",
test.name, test.args, test.exitcode, ret)
tt.name, tt.args, tt.exitcode, ret)
}
if !strings.Contains(buf.String(), test.expect) {
if !strings.Contains(buf.String(), tt.expect) {
t.Errorf("%s with cmd <%s> output did not match.\nExpect: %s\n Got: %s\n",
test.name, test.args, test.expect, buf.String())
tt.name, tt.args, tt.expect, buf.String())
}
}
@@ -542,21 +540,20 @@ func TestMainInvalids(t *testing.T) {
SetIntercept(InitInvalidSources())
// run commandline tests
for _, test := range invalidtests {
for _, tt := range invalidtests {
var buf bytes.Buffer
os.Args = strings.Split(test.args, " ")
os.Args = strings.Split(tt.args, " ")
ret := Main(&buf)
if ret != test.exitcode {
if ret != tt.exitcode {
t.Errorf("%s with cmd <%s> did not exit with %d but %d",
test.name, test.args, test.exitcode, ret)
tt.name, tt.args, tt.exitcode, ret)
}
if !strings.Contains(buf.String(), test.expect) {
if !strings.Contains(buf.String(), tt.expect) {
t.Errorf("%s with cmd <%s> output did not match.\nExpect: %s\n Got: %s\n",
test.name, test.args, test.expect, buf.String())
tt.name, tt.args, tt.expect, buf.String())
}
}
}

View File

@@ -22,12 +22,7 @@ freebsd/amd64
linux/amd64
netbsd/amd64
openbsd/amd64
windows/amd64
freebsd/arm64
linux/arm64
netbsd/arm64
openbsd/arm64
windows/arm64"
windows/amd64"
tool="$1"
version="$2"

106
scrape.go
View File

@@ -19,10 +19,10 @@ package main
import (
"bytes"
"errors"
"fmt"
"log/slog"
"path/filepath"
"strconv"
"strings"
"time"
@@ -43,9 +43,7 @@ func ScrapeUser(fetch *Fetcher) error {
for {
var index Index
slog.Debug("fetching page", "uri", uri)
body, err := fetch.Get(uri)
if err != nil {
return err
@@ -54,7 +52,7 @@ func ScrapeUser(fetch *Fetcher) error {
err = goq.NewDecoder(body).Decode(&index)
if err != nil {
return fmt.Errorf("failed to goquery decode HTML index body: %w", err)
return err
}
if len(index.Links) == 0 {
@@ -69,16 +67,16 @@ func ScrapeUser(fetch *Fetcher) error {
}
page++
uri = baseuri + "&pageNum=" + strconv.Itoa(page)
uri = baseuri + "&pageNum=" + fmt.Sprintf("%d", page)
}
for index, adlink := range adlinks {
for i, adlink := range adlinks {
err := ScrapeAd(fetch, Baseuri+adlink)
if err != nil {
return err
}
if fetch.Config.Limit > 0 && index == fetch.Config.Limit-1 {
if fetch.Config.Limit > 0 && i == fetch.Config.Limit-1 {
break
}
}
@@ -88,20 +86,18 @@ func ScrapeUser(fetch *Fetcher) error {
// scrape an ad. uri is the full uri of the ad, dir is the basedir
func ScrapeAd(fetch *Fetcher, uri string) error {
advertisement := &Ad{}
ad := &Ad{}
// extract slug and id from uri
uriparts := strings.Split(uri, "/")
if len(uriparts) < SlugURIPartNum {
return fmt.Errorf("invalid uri: %s", uri)
if len(uriparts) < 6 {
return errors.New("invalid uri: " + uri)
}
advertisement.Slug = uriparts[4]
advertisement.ID = uriparts[5]
ad.Slug = uriparts[4]
ad.Id = uriparts[5]
// get the ad
slog.Debug("fetching ad page", "uri", uri)
body, err := fetch.Get(uri)
if err != nil {
return err
@@ -109,53 +105,36 @@ func ScrapeAd(fetch *Fetcher, uri string) error {
defer body.Close()
// extract ad contents with goquery/goq
err = goq.NewDecoder(body).Decode(&advertisement)
if err != nil {
return fmt.Errorf("failed to goquery decode HTML ad body: %w", err)
}
if len(advertisement.CategoryTree) > 0 {
advertisement.Category = strings.Join(advertisement.CategoryTree, " => ")
}
if advertisement.Incomplete() {
slog.Debug("got ad", "ad", advertisement)
return fmt.Errorf("could not extract ad data from page, got empty struct")
}
advertisement.CalculateExpire()
// prepare ad dir name
addir, err := AdDirName(fetch.Config, advertisement)
err = goq.NewDecoder(body).Decode(&ad)
if err != nil {
return err
}
proceed := CheckAdVisited(fetch.Config, addir)
if !proceed {
return nil
if len(ad.CategoryTree) > 0 {
ad.Category = strings.Join(ad.CategoryTree, " => ")
}
if ad.Incomplete() {
slog.Debug("got ad", "ad", ad)
return errors.New("could not extract ad data from page, got empty struct")
}
ad.CalculateExpire()
// write listing
err = WriteAd(fetch.Config, advertisement, addir)
addir, err := WriteAd(fetch.Config, ad)
if err != nil {
return err
}
// tell the user
slog.Debug("extracted ad listing", "ad", advertisement)
slog.Debug("extracted ad listing", "ad", ad)
// stats
fetch.Config.IncrAds()
// register for later checks
DirsVisited[addir] = 1
return ScrapeImages(fetch, advertisement, addir)
return ScrapeImages(fetch, ad, addir)
}
func ScrapeImages(fetch *Fetcher, advertisement *Ad, addir string) error {
func ScrapeImages(fetch *Fetcher, ad *Ad, addir string) error {
// fetch images
img := 1
adpath := filepath.Join(fetch.Config.Outdir, addir)
@@ -166,17 +145,16 @@ func ScrapeImages(fetch *Fetcher, advertisement *Ad, addir string) error {
return err
}
egroup := new(errgroup.Group)
g := new(errgroup.Group)
for _, imguri := range advertisement.Images {
for _, imguri := range ad.Images {
imguri := imguri
file := filepath.Join(adpath, fmt.Sprintf("%d.jpg", img))
egroup.Go(func() error {
g.Go(func() error {
// wait a little
throttle := GetThrottleTime()
time.Sleep(throttle)
t := GetThrottleTime()
time.Sleep(t)
body, err := fetch.Getimage(imguri)
if err != nil {
@@ -184,15 +162,14 @@ func ScrapeImages(fetch *Fetcher, advertisement *Ad, addir string) error {
}
buf := new(bytes.Buffer)
_, err = buf.ReadFrom(body)
if err != nil {
return fmt.Errorf("failed to read from image buffer: %w", err)
return err
}
reader := bytes.NewReader(buf.Bytes())
buf2 := buf.Bytes() // needed for image writing
image := NewImage(reader, file, imguri)
image := NewImage(buf, file, imguri)
err = image.CalcHash()
if err != nil {
return err
@@ -200,34 +177,27 @@ func ScrapeImages(fetch *Fetcher, advertisement *Ad, addir string) error {
if !fetch.Config.ForceDownload {
if image.SimilarExists(cache) {
slog.Debug("similar image exists, not written", "uri", image.URI)
slog.Debug("similar image exists, not written", "uri", image.Uri)
return nil
}
}
_, err = reader.Seek(0, 0)
if err != nil {
return fmt.Errorf("failed to seek(0) on image reader: %w", err)
}
err = WriteImage(file, reader)
err = WriteImage(file, buf2)
if err != nil {
return err
}
slog.Debug("wrote image", "image", image, "size", buf.Len(), "throttle", throttle)
slog.Debug("wrote image", "image", image, "size", len(buf2), "throttle", t)
return nil
})
img++
}
if err := egroup.Wait(); err != nil {
return fmt.Errorf("failed to finalize error waitgroup: %w", err)
if err := g.Wait(); err != nil {
return err
}
fetch.Config.IncrImgs(len(advertisement.Images))
fetch.Config.IncrImgs(len(ad.Images))
return nil
}

102
store.go
View File

@@ -26,102 +26,77 @@ import (
"runtime"
"strings"
tpl "text/template"
"time"
)
type OutdirData struct {
Year, Day, Month string
}
func OutDirName(conf *Config) (string, error) {
tmpl, err := tpl.New("outdir").Parse(conf.Outdir)
func AdDirName(c *Config, ad *Ad) (string, error) {
tmpl, err := tpl.New("adname").Parse(c.Adnametemplate)
if err != nil {
return "", fmt.Errorf("failed to parse outdir template: %w", err)
return "", err
}
buf := bytes.Buffer{}
now := time.Now()
data := OutdirData{
Year: now.Format("2006"),
Month: now.Format("02"),
Day: now.Format("01"),
}
err = tmpl.Execute(&buf, data)
err = tmpl.Execute(&buf, ad)
if err != nil {
return "", fmt.Errorf("failed to execute outdir template: %w", err)
return "", err
}
return buf.String(), nil
}
func AdDirName(conf *Config, advertisement *Ad) (string, error) {
tmpl, err := tpl.New("adname").Parse(conf.Adnametemplate)
func WriteAd(c *Config, ad *Ad) (string, error) {
// prepare ad dir name
addir, err := AdDirName(c, ad)
if err != nil {
return "", fmt.Errorf("failed to parse adname template: %w", err)
return "", err
}
buf := bytes.Buffer{}
err = tmpl.Execute(&buf, advertisement)
if err != nil {
return "", fmt.Errorf("failed to execute adname template: %w", err)
}
return buf.String(), nil
}
func WriteAd(conf *Config, advertisement *Ad, addir string) error {
// prepare output dir
dir := filepath.Join(conf.Outdir, addir)
err := Mkdir(dir)
dir := filepath.Join(c.Outdir, addir)
err = Mkdir(dir)
if err != nil {
return err
return "", err
}
// write ad file
listingfile := filepath.Join(dir, "Adlisting.txt")
listingfd, err := os.Create(listingfile)
f, err := os.Create(listingfile)
if err != nil {
return fmt.Errorf("failed to create Adlisting.txt: %w", err)
return "", err
}
defer listingfd.Close()
defer f.Close()
if runtime.GOOS == WIN {
advertisement.Text = strings.ReplaceAll(advertisement.Text, "<br/>", "\r\n")
if runtime.GOOS == "windows" {
ad.Text = strings.ReplaceAll(ad.Text, "<br/>", "\r\n")
} else {
advertisement.Text = strings.ReplaceAll(advertisement.Text, "<br/>", "\n")
ad.Text = strings.ReplaceAll(ad.Text, "<br/>", "\n")
}
tmpl, err := tpl.New("adlisting").Parse(conf.Template)
tmpl, err := tpl.New("adlisting").Parse(c.Template)
if err != nil {
return fmt.Errorf("failed to parse adlisting template: %w", err)
return "", err
}
err = tmpl.Execute(listingfd, advertisement)
err = tmpl.Execute(f, ad)
if err != nil {
return fmt.Errorf("failed to execute adlisting template: %w", err)
return "", err
}
slog.Info("wrote ad listing", "listingfile", listingfile)
return nil
return addir, nil
}
func WriteImage(filename string, reader *bytes.Reader) error {
func WriteImage(filename string, buf []byte) error {
file, err := os.Create(filename)
if err != nil {
return fmt.Errorf("failed to open image file: %w", err)
return err
}
defer file.Close()
_, err = reader.WriteTo(file)
_, err = file.Write(buf)
if err != nil {
return fmt.Errorf("failed to write to image file: %w", err)
return err
}
return nil
@@ -136,12 +111,12 @@ func ReadImage(filename string) (*bytes.Buffer, error) {
data, err := os.ReadFile(filename)
if err != nil {
return nil, fmt.Errorf("failed to read image file: %w", err)
return nil, err
}
_, err = buf.Write(data)
if err != nil {
return nil, fmt.Errorf("failed to write image into buffer: %w", err)
return nil, err
}
return &buf, nil
@@ -152,24 +127,5 @@ func fileExists(filename string) bool {
if os.IsNotExist(err) {
return false
}
return !info.IsDir()
}
// check if an addir has already been processed by current run and
// decide what to do
func CheckAdVisited(conf *Config, adname string) bool {
if Exists(DirsVisited, adname) {
if conf.ForceDownload {
slog.Warn("an ad with the same name has already been downloaded, overwriting", "addir", adname)
return true
}
// don't overwrite
slog.Warn("an ad with the same name has already been downloaded, skipping (use -f to overwrite)", "addir", adname)
return false
}
// overwrite
return true
}

View File

@@ -18,7 +18,6 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
package main
import (
"bytes"
"testing"
)
@@ -27,13 +26,12 @@ import (
// doesn't show up in the coverage report for unknown reasons, so
// here's a single test for it
func TestWriteImage(t *testing.T) {
t.Parallel()
reader := bytes.NewReader([]byte{1, 2, 3, 4, 5, 6, 7, 8})
buf := []byte{1, 2, 3, 4, 5, 6, 7, 8}
file := "t/out/t.jpg"
err := WriteImage(file, reader)
err := WriteImage(file, buf)
if err != nil {
t.Errorf("Could not write mock image to %s: %s", file, err.Error())
}
}

View File

@@ -1,6 +1,6 @@
# empty config for Main() unit tests to force unit tests NOT to use an
# eventually existing ~/.kleingebaeck!
template="""
{{.Title}}{{.Price}}{{.ID}}{{.Category}}{{.Condition}}{{.Created}}
{{.Title}}{{.Price}}{{.Id}}{{.Category}}{{.Condition}}{{.Created}}
"""

View File

@@ -2,5 +2,5 @@ user = 1
loglevel = "verbose"
outdir = "t/out"
template="""
{{.Title}}{{.Price}}{{.ID}}{{.Category}}{{.Condition}}{{.Created}}
{{.Title}}{{.Price}}{{.Id}}{{.Category}}{{.Condition}}{{.Created}}
"""

View File

@@ -1,7 +1,5 @@
#!/bin/sh -x
base="../kleinanzeigen"
rm -rf $base
mkdir -p $base
echo "Generating /s-bestandsliste.html"

23
util.go
View File

@@ -1,5 +1,5 @@
/*
Copyright © 2023-2024 Thomas von Dein
Copyright © 2023 Thomas von Dein
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -20,7 +20,6 @@ package main
import (
"bytes"
"errors"
"fmt"
"math/rand"
"os"
"os/exec"
@@ -32,9 +31,9 @@ import (
func Mkdir(dir string) error {
if _, err := os.Stat(dir); errors.Is(err, os.ErrNotExist) {
err := os.MkdirAll(dir, os.ModePerm)
err := os.Mkdir(dir, os.ModePerm)
if err != nil {
return fmt.Errorf("failed to create directory %s: %w", dir, err)
return err
}
}
@@ -45,8 +44,7 @@ func man() error {
man := exec.Command("less", "-")
var b bytes.Buffer
b.WriteString(manpage)
b.Write([]byte(manpage))
man.Stdout = os.Stdout
man.Stdin = &b
@@ -55,7 +53,7 @@ func man() error {
err := man.Run()
if err != nil {
return fmt.Errorf("failed to execute 'less': %w", err)
return err
}
return nil
@@ -63,7 +61,7 @@ func man() error {
// returns TRUE if stdout is NOT a tty or windows
func IsNoTty() bool {
if runtime.GOOS == WIN || !isatty.IsTerminal(os.Stdout.Fd()) {
if runtime.GOOS == "windows" || !isatty.IsTerminal(os.Stdout.Fd()) {
return true
}
@@ -74,12 +72,3 @@ func IsNoTty() bool {
func GetThrottleTime() time.Duration {
return time.Duration(rand.Intn(MaxThrottle-MinThrottle+1)+MinThrottle) * time.Millisecond
}
// look if a key in a map exists, generic variant
func Exists[K comparable, V any](m map[K]V, v K) bool {
if _, ok := m[v]; ok {
return true
}
return false
}