When looking at a non-trivial codebase for the first time, it's very difficult to understand anything of it until you get the "Big Picture". This page is meant to, hopefully, make you get dupeGuru's big picture.

Model/View/Controller... nope!

dupeGuru's codebase has quite a few design flaws. The Model, View and Controller roles are filled by different classes, scattered around. If you're aware of that, it might help you to understand what the heck is going on.

The central piece of dupeGuru is dupeguru.app.DupeGuru (in the py code). It's the only interface to the python's code for the GUI code. A duplicate scan is started with start_scanning(), directories are added through add_directory(), etc..

A lot of functionalities of the App are implemented in the platform-specific subclasses of app.DupeGuru, like app_cocoa.DupeGuru, or the base.app.DupeGuru class in the PyQt codebase. For example, when performing "Remove Selected From Results", app_cocoa.Dupeguru.RemoveSelected() on the Obj-C side, and base.app.DupeGuru.remove_duplicates() on the PyQt side, are respectively called to perform the thing. All of this is quite ugly, I know (see the "Refactoring" section below).


A lot of operations in dupeGuru take a significant amount of time. This is why there's a generalized threaded job mechanism built-in app.DupeGuru. First, app.DupeGuru has a progress member which is an instance of hsutil.job.ThreadedJobPerformer. It lets the GUI code know of the progress of the current threaded job. When app.DupeGuru needs to start a job, it calls _start_job() and the platform specific subclass deals with the details of starting the job.

Core principles

The core of the duplicate matching takes place (for SE and ME, not PE) in dupeguru.engine. There's MatchFactory.getmatches() which take a list of hsfs.File instances and return a list of (firstfile, secondfile, match_percentage) matches. Then, there's get_groups() which takes a list of matches and returns a list of Group instances (a Group is basically a list of hsfs.File matching together).

When a scan is over, the final result (the list of groups from get_groups()) is placed into app.DupeGuru.results, which is a results.Results instance. The Results instance is where all the dupe marking, sorting, removing, power marking, etc. takes place.


As I mentioned at the beginning of the page, quite a few design mistakes have been made during the development of dupeGuru. One could argue that there should be a huge refactoring work done on the codebase at once, and then be done with it. The problem is that huge refactorings are error-prone, especially with a weak testunit coverage. Also, dupeGuru's development is not as active as it used to be. Sure, there are still features to be implemented, but nothing major (except the recent dupeGuru PE cython/multiprocessing improvement). The approach I want to take on this is the "slowly but surely" approach. So, how it works is that when you're about to work on a piece of code that needs refactoring, then do the refactoring. Until you need to work on that piece of code, leave it alone. Here's a list of ongoing refactorings:

Obj-C's dgbase merge. When I created the different dupeGuru editions, I made the awful mistake of copy/pasting the whole Obj-C code, then just modifying what needed it. I know that was stupid, but I did it anyway. Then, a while after, I created the dgbase project which contains Obj-C code common to all editions. Instead of moving it all at once, which would have been error prone, I just slowly push code down to dgbase when appropriate. Therefore, whenever a piece of Obj-C code is about to be modified, if it's common to all editions, it has to be moved down to dgbase first. No exception. If you copy/paste your modification 3 times, it means you're doing something wrong.

PEP8. There are still some CamelCaseMethods lying around. When working near one of them, just change them to lowercase_with_underscore() (don't forget the project-wide search/replace).

Platform-independent code in platform-specific units. Some behavior in dupeGuru is defined by code in the platform-specific units, but is in fact platform-independent behavior. This is actually pretty tricky to refactor, because we're not dealing with clear-cut code duplication here. Pushing that behavior down to platform independent units usually involves building an override mechanism and stuff like that.

Placing specific and common code where they belong. Some code is not at the right place. For example, app_cocoa is not supposed to be in dupeguru, which is platform-independent code. But there's also platform-independent-but-edition-specific code, like dupeguru.picture. Although this unit is platform-independent, it is being checked out with deupGuru ME and dupeGuru SE. This normally shouldn't be so. However, this kind of refactoring is tricky to do, and I'm not exactly sure how the code should be arranged for everything to be at the correct place. This has to be thought out.

PyQt camelCase. My first experience with PyQt was by porting dupeGuru's .NET code to PyQt. At first, I used underscore_method_names(), but later, I decided I'd switch to camelCase() for PyQt code to blend in more with Qt's style. The result is that there's an ongoing refactoring changing underscore_method() to camelCase methods.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Shared Ticket Bins