When looking at a non-trivial codebase for the first time, it's very difficult to understand anything of it until you get the "Big Picture". This page is meant to, hopefully, make you get dupeGuru's big picture.
Model/View/Controller... nope!
dupeGuru's codebase has quite a few design flaws. The Model, View and Controller roles are filled by different classes, scattered around. If you're aware of that, it might help you to understand what the heck is going on.
The central piece of dupeGuru is
dupeguru.app.DupeGuru
(in the py
code).
It's the only interface to the python's code for the GUI code. A
duplicate scan is started with start_scanning()
,
directories are added through add_directory()
,
etc..
A lot of functionalities of the App are implemented in the
platform-specific subclasses of app.DupeGuru
, like
app_cocoa.DupeGuru
, or the
base.app.DupeGuru
class in the PyQt codebase. For
example, when performing "Remove Selected From Results",
app_cocoa.Dupeguru.RemoveSelected()
on the Obj-C side,
and base.app.DupeGuru.remove_duplicates()
on the PyQt
side, are respectively called to perform the thing. All of this is
quite ugly, I know (see the "Refactoring" section below).
Jobs
A lot of operations in dupeGuru take a significant amount of
time. This is why there's a generalized threaded job mechanism
built-in app.DupeGuru
. First,
app.DupeGuru
has a progress
member which
is an instance of hsutil.job.ThreadedJobPerformer
. It
lets the GUI code know of the progress of the current threaded job.
When app.DupeGuru
needs to start a job, it calls
_start_job()
and the platform specific subclass deals
with the details of starting the job.
Core principles
The core of the duplicate matching takes place (for SE and ME,
not PE) in dupeguru.engine
. There's
MatchFactory.getmatches()
which take a list of
hsfs.File
instances and return a list of
(firstfile, secondfile, match_percentage)
matches.
Then, there's get_groups()
which takes a list of
matches and returns a list of Group
instances (a
Group
is basically a list of hsfs.File
matching together).
When a scan is over, the final result (the list of groups from
get_groups()
) is placed into
app.DupeGuru.results
, which is a
results.Results
instance. The Results
instance is where all the dupe marking, sorting, removing, power
marking, etc. takes place.
Refactoring
As I mentioned at the beginning of the page, quite a few design mistakes have been made during the development of dupeGuru. One could argue that there should be a huge refactoring work done on the codebase at once, and then be done with it. The problem is that huge refactorings are error-prone, especially with a weak testunit coverage. Also, dupeGuru's development is not as active as it used to be. Sure, there are still features to be implemented, but nothing major (except the recent dupeGuru PE cython/multiprocessing improvement). The approach I want to take on this is the "slowly but surely" approach. So, how it works is that when you're about to work on a piece of code that needs refactoring, then do the refactoring. Until you need to work on that piece of code, leave it alone. Here's a list of ongoing refactorings:
Obj-C's dgbase merge. When I created the
different dupeGuru editions, I made the awful mistake of
copy/pasting the whole Obj-C code, then just modifying what needed
it. I know that was stupid, but I did it anyway. Then, a while
after, I created the dgbase
project which contains
Obj-C code common to all editions. Instead of moving it all at
once, which would have been error prone, I just slowly push code
down to dgbase when appropriate. Therefore, whenever a piece of
Obj-C code is about to be modified, if it's common to all editions,
it has to be moved down to dgbase first. No exception. If
you copy/paste your modification 3 times, it means you're doing
something wrong.
PEP8. There are still some
CamelCaseMethods
lying around. When working near one
of them, just change them to
lowercase_with_underscore()
(don't forget the
project-wide search/replace).
Platform-independent code in platform-specific units. Some behavior in dupeGuru is defined by code in the platform-specific units, but is in fact platform-independent behavior. This is actually pretty tricky to refactor, because we're not dealing with clear-cut code duplication here. Pushing that behavior down to platform independent units usually involves building an override mechanism and stuff like that.
Placing specific and common code where they
belong. Some code is not at the right place. For example,
app_cocoa
is not supposed to be in
dupeguru
, which is platform-independent code. But
there's also platform-independent-but-edition-specific code, like
dupeguru.picture
. Although this unit is
platform-independent, it is being checked out with deupGuru ME and
dupeGuru SE. This normally shouldn't be so. However, this kind of
refactoring is tricky to do, and I'm not exactly sure how the code
should be arranged for everything to be at the correct place. This
has to be thought out.
PyQt camelCase. My first experience with PyQt was by porting dupeGuru's .NET code to PyQt. At first, I used underscore_method_names(), but later, I decided I'd switch to camelCase() for PyQt code to blend in more with Qt's style. The result is that there's an ongoing refactoring changing underscore_method() to camelCase methods.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป