- 7/17/07, 3:30: So, I've been working on the atmospheric
code for about 4 days now. The first few were mostly just looking
through the code trying to figure out what in the world it did. I
also discovered I hadn't learned as much Fortran 90 as I thought
(there were a lot of weird constructs in the code I didn't recognize
at all), so I did some Googling of various pieces of it as well. Now,
though, I'm actually comfortable enough with it to start some basic
timings of the code to see where the bottlenecks are. Right now the
compiler (we have to use a trial Intel compiler to get it to work
correctly) is a little weird and it will only compile on 32-bit
machines, but we're getting there. But, at least the code RUNS! I'm
amazed we got it even running this quickly.
- 7/11/07, 11;50: It's done! I have all the data collected
and the final report done for the toy problem (check it out here)!
So, I think I can start
on the *real* atmospheric science project now :).
- 7/10/07, 5:30: Heh heh....all those wierd issues with the
timing? They're ALL from one wrong type, in the initial subroutine,
at that. The initial input is NOT a real(kind=8), it's just a real
(which is only 4 bytes). However, in the subroutine, I'd typed it as
an 8-byte, and somehow that screwed everything up. I can see it
eating the values going in (they all became 0s) but WHY did it make
everything run about 10x slower? How would it even affect that? And
why in the serial version, did it *only* affect it if I printed out
the variable? Whatever, it seems to work now, so all I have to do is
finish up the final report, and I'm done with the toy project!
- 7/9/07, 3:00: So, it turns out that a lot of the times
aren't coming out at all like I'd expect. The latest set of data has
the parallel version on a single 64-bit computer taking almost 400
seconds, and all three of the latest versions take a lot longer than
they did before, and all I did was change the timing mechanism from
gettimeofday (or its Fortran binding) to MPI's WTIME, which
theoretically should NOT change how long it actually takes to run
(quotations about the act of measuring something changing it aside).
Also, it's odd that my dual-core machine is *so* much faster than the
single-cores, but Michelle says that's probably a difference in cache
sizes, and I know my machine has more memory than the single-cores, so
it having a bigger cache wouldn't surprise me. The other issue,
though, is that the parallel times are incredibly short (in the range
of the *old* times I was getting, though, so they in and of themselves
are quite reasonable). Michelle brought up the point that sometimes
compilers will optimize out the vast majority of your program if it's
not relavent to the output (!) but that doesn't *seem* to be happening
here....I'm not really sure what's going on, but to me, it seems like
the *new* times are wrong, they just don't fit everything else.
- 7/6/07, 2:00: CoGrid finally works!! It turns out after
all that, that the issue was the types of variables I was using (real
vs. real(KIND=8), vs. double precision). Once I got that all
straightened out, it was happy; it really WAS dividing by 0 (I *think*
the dt and dx values were getting corrupted by other variables having
the wrong datatype, and then were being treated as 0). So, finally, I
have real CoGrid times!!
- 7/5/07, 12:30: I finally have the report done! (or at
least a first draft of it). Check it out here!
Once I got the data recollected with the *correct* timing, it was
pretty easy to finish (I had most of the writing done already). So
now, I just have to figure out how to get CoGrid working, then try
NCAR, and then I get to start working on the *real* problem!
- 7/3/07, 2:15: UPDATE: It turns out that there wasn't
actually anything wrong with the data from the Linux computer
runs....I just was measuring the wrong thing! I was measuring CPU
time, and really I needed total execution time (wall clock). That's
because if you only measure CPU time, then as you add more processes,
by definition each one will take less CPU time...but since the CPU
itself isn't really parallel, they have to be swapped in and out to
all run in "parallel", which slows it down quite a bit. I was
actually effectively measuring speedup/CPU MHz, rather than
speedup/number of processes! Oh well, at least now I know!
For the last several days I've been fighting
with several different problems. One is that CoGrid still won't
work...the latest is some kind of strange socket error when the last
node tries to recieve from the second-to-last node, but then dies for
some reason. At least it compiles and tries to run now....but I have
no idea why it won't connect to itself correctly when it can on the
Linux machines in the department with no problem. *sigh*...
The other odd problem is with my 7-processor and 1-processor sets of
data. They're virtually identical. And they really shouldn't be.
The single-processor run should be a lot slower than the 7-processor
run, and a little slower than the 2-processor run, but instead they're
both faster than the 2-processor one. As far as I can tell,
everything is running where it's supposed to be, so we don't have a
clue why the times are so close- within 100ths of a second or so. The
problem is, it's hard to analyze data when the data you're getting is
fairly obviously flawed somehow....oh well, other than that, my
report's basically done.
- 6/27/07, 11:00: I have been trying for the last day to run
my stuff on CoGrid, but every time I try, it refuses to compile. I'd
been using .f90 files, and it didn't create any errors, but it never
made the object files either, so it wouldn't compile. Michelle
suggested using the older (?) .f extension, so I did, but then it is
incompatible with Fortran90 even if I select the Fortran90 compiler.
So, basically CoGrid is just a pain, and it looks like I'm not going
to be able to run jobs on it using Fortran. Oh well....at least it's
not the only computer I have to work with.
- 6/26/07, 11:05: I have data from real sets of computers
now! There's still something wierd with trying to run on 32-bit
machines- Michelle thinks it's something with it not comiling
correctly, but since the stuff I'm going to be working with is mostly
64-bit, we're kind of ignoring it for now. I spent the morning making
another set of scripts so that I can run jobs on CoGrid, since
obviously one master script that runs everything can't deal with
running jobs on CoGrid in the middle. This one basically takes the
original set of scripts and breaks them up- the first part sets up all
the files, then I run the actual experiments on CoGrid with a second
script, then I use the sql parsing script to make a sql table and a
Gnuplot graph with the results. Now to see if it will actually work
on CoGrid and not just on my machine!
- 6/25/07, 11:15: I've gotten the whole setup to work
correctly for the toy problem, but only on 64-bit ones. If I run on any 32-bit machines it hangs
(in the "main" node) indefinitely. I'm really not sure
what's going on with that...maybe I have some kind of infinite loop
(or more likely deadlock) that only occurs some of the time, and it
shows up when there are more machines being run? Maybe it's something
if I wind up breaking the bar up into more pieces than there are
points? But I'm specifying the number of MPI processes constantly (right now
at 4) so I don't know how adding more MACHINES would break that, when
the number it's using is constant (and why would it break if I have 3
processors on 3 machines but not 4 processors on 2, or 1-2 processors
on 1?).
I also fixed the webpages so that they'll display correctly on
different window sizes, not just the one I'm using.
- 6/14/07, 17:00: Today I got everything for running the
tests automated (using a mess of shell scripts calling other shell
scripts calling python scripts calling Gnuplot, but hey, it works...),
meaning that to generate my graphs, all I should have to do is run the
main shell script with the desired input parameters, and a little
while later my set of graphs *should* pop out in the correct
subdirectories. I really want to see it work on a real run, not just
my two-machine tests I've been doing today!
- 6/13/07, 13:30: I finally got the parallel implementation
of my practice problem done (after writing about three different
implementations of it and eventually coming up [with Michelle's
help :)] with a solution that was way easier than what I had been trying)! I had a lot of trouble with the MPI_RECV
command not getting the correct ID of the process it was recieving
from, and neither Dave or I could figure out why. It turns out that
there's something weird in Fortran that if you declare an array using
'Dimension', it seems to index it backwards from how MPI is expecting
it to be indexed. So it wasn't actually an MPI problem at
all...
- 6/12/07, 13:20: This morning I went to an interesting
presentation on QT. This summer, there's a whole series of these
presentations on various useful tools, and this is the first one I'd
been to. I'd used QT a little for a project last semester (nothing I
really had to write), but it sounds like there's a lot more to it than
I thought. I thought it was just a GUI-maker, sort of like Java's AWT
and Swing, but it seems almost like an entire language sort of based
on top of C++, that lets you do all kinds of things like threads,
sockets, etc. I'll have to mess around with it sometime.
- 6/11/07, 10:00: So, I get to start writing a parallel
version of the temperature-of-a-bar problem now...I'm going to be
using MPI to do it, but I really have no clue how to use it. It
doesn't look incredibly complicated on the surface, but it seems like
one of those things that has a lot of random little details that cause
you a lot of problems until you understand them.
- 6/8/07, 14:15: I finally got the printf-creating python
code running, once I figured out what I needed to print, and how to do
it in Fortran. For some reason, writing a program to make a print
statement is much harder than it seems like it should be, but it works
now. I also now have a working machine again, as this morning it
decided that I had no permissions to do anything at all. According to
the sysadmin, it was probably because of some confusion with switching
my machine for a new one, but it's all better now and likes me again
:).
- 6/7/07, 15:15: Yay! I got my *new* computer today (turns
out the one I had been assigned before was really somebody else's), so
no more swapping of monitors, computers, etc. I also have a shiny working
version of the temperature-of-a-bar problem, written in Fortran.
Oddly, a lot of the problems near the end were with trying to get the
output to work the way I wanted- Fortran is surprisingly awkward when
it comes to formatting things, I think because it goes back to the
days when you printed things from printers and read them in from punch
cards.
- 6/6/07, 15:00: Today I'm pretty much just learning how all
the tools and stuff work, and figuring out how to actually write a
program in Fortran90. It seems to me like it's kind of an awkward
language (I miss my 'for' statements, and being able to declare
temporary variables close to where I need them), and there are a lot
of things about it that just seem like they made sense when the
language was originally developed, but really should have been changed
since then. The output formatting comes to mind- I'd really
like to be able to declare print statements without a newline, rather
than having to do complicated things with control characters to get it
back up to the previous line! But I suppose it makes more sense if
your output device is a printer, rather than a screen. Overall, it's
a pretty straightforward language, but I feel sort of limited in what
I can do with it.
- 6/5/07, 16:20: Today I actually have a computer and a desk
in the grad area, and keys to get INTO the grad area :). Pretty much
today was learning the sorts of tools we'll be working with, and sort
of the overall plan of how the summer's going to go. We had a group
meeting this morning to discuss the first reading "assignment" and
just kind of discuss who's doing what (and introduce me to the people
I'll be working with, since they've been here a week already). It
seems like it should be a really interesting summer, and I think I'll
learn a lot.
- 6/4/07, 16:00: My first day at work! Unfortunately, I don't
really have a desk right now, because while there are two computers
there (one of which doesn't seem to be a department one, and nobody
knows who it belongs to), neither is the one that's supposed to be
assigned to my desk, since the new one hasn't come in yet. I should
get it tomorrow, if I'm lucky. So, most of today was me learning
Fortran90 (Michelle gave me a couple of books to read on it) and
getting things like the NCAR logon stuff set up. Overall, there was a
lot of information all at once, but it seems like it should be an
interesting and fun summer.