10.23.08

Helping

Posted in General, Computer Science at 1:47 am by Nick

Moving my help sessions to Tuesday’s has made a big impact! I was happy to be approached for programming help almost immediately of my sitting down .

I went over my own time limit helping three students, so I’m curious how many will start showing up for more advanced help. It was wonderful to talk with some students about programming in a very non-academic setting. A setting where everyone could be perfectly honest and not feel bashful if they didn’t understand something simple. It is this candid setting I have wanted to promote for some time now. Students will learn better when the material becomes more relevant to them and they are able to be brutally honest about how they view it.

Interestingly enough, all three students were non-CS majors. I’m not entirely sure what that means overall. They are taking a class they don’t need -AND- they are seeking help outside of class hours. That alone is unusual. We also discussed all the concepts on paper and each student hand wrote their own notes, which was surprising to me as well. Overall I left feeling not only accomplished, but impressed. I hope these students return and more follow their example. I think it would help alleviate much of the frustration and anxiety many of them feel about the class.

That said, I was also pretty happy that we went through and wrote a program off the top of our heads. I didn’t write anything out ahead of time, so I was discovering how to solve the homework problem as they were. I think it was helpful for them to see how someone who has more experience thinks through the process and makes changes to the design as the program progressed. They expressed that they understand the concepts of what to do for the homework, but that initial leap into writing it out is whats so challenging. And I agree, it is hard to take idea’s out of your head and phrase them in a way a computer understands, which is why CS majors get the big bucks.

When all was done and I typed the program I came up with into my computer it also worked as intended (minus a few syntax errors), which is always a plus. I would hate to give them an example that I thought would work only to lead them astray. I think that next week, if the students aren’t pressed for time, I’ll try a more Socratic approach and see if they can come up with the ideas for each step.

03.06.08

Ailing Education

Posted in School, College, Computer Science, Thoughts at 5:50 pm by Nick

In recent times it has been very hard for me to attend class and get things done. This is due mainly to a troubling and complex family and living situation. Thus, I have decided to withdraw from two of my classes and work harder towards not marring up my academic record then I have already. Along with this, I am going to be moving during spring break.

However, I must say I am absolutely unimpressed (and disappointed) with some parts of my education. I know it may feel like an attack, but it should probably be heard. It is no news to the Computer Science program at UNI that enrollment is down and there are many people who switch majors. After talking to -many- CS majors (mostly freshman), I have decided it has little to do with the course content. If I could choose, there would be many things I would change based on my personal experience and feedback.

The first thing I would change would be our introductory courses. I know this is talked about and debated a lot. But what we have taught recently, Ada, C/C++, and Java, is a terrible choice. It is no reason our majors are running away! None of these languages are terribly user friendly for beginners. To exacerbate the problems, we thrust our students in without ever telling them what it will be used for, how it will be used, or why we use those particular languages. We don’t mention other alternatives. Why is this? Why do we not teach introductory classes with dynamic typing, like PHP or Ruby. I have never heard more complaints then about students who have no idea what the different between certain data types are. In C/C++ there are all kinds of exceptions, like using char’s as small integers, unsigned and signed, and C-style strings vs. C++ string data type. How are the students supposed to figure that all out? Ada is a language designed initially for embedded systems by the Department of Defense. That line alone is enough to scare freshman. There is nothing simple in these languages. So why do we teach them in our introductory courses? Our degree program already requires an extra Comp Sci general class. Why not save these industrial languages for them?

Students need to learn the simplest things first. These languages were not designed to be simple. Yes, if you know how to use them they are not as complicated as they first seem. But there is no reason to think a freshman in computer science, with no previous programming background, will understand the concept of passing the address of data to a function. I know several good programmers who still do not completely understand pointers. To teach the simplest thing possible, we need the language with the least amount of baggage. Something you can get using quickly, and without a lot of quirks. In C++, you need includes, a main function, knowledge of data types and operators before you can do anything. In dynamic typed interpreted language, you need basic operator knowledge, and that is all.

And yet I fear we cannot change our introductory languages because we have our two full professors teaching our introductory courses. Never mind the fact that the C/C++ students learn C, not C++. C++ comes later and is merely brushed on. I understand better then most the overlap between the languages, but there are some huge differences. I think students in the C++ section should be taught about C-style strings (that is, character arrays) but they should be taught about the more commonly used (and far more useful) STL string data type.

I strongly believe our faculty grew up when computer science was brand new. They all have fantastic and interesting stories. But this age is different. Our upcoming CS students are different. The times have changed. Computer science is no longer ‘new’, it is now ‘young’. The students will not spend all of their nights up late, huddled over their computer attempting to fix a broken program. That type of dedication is gone. It is the zeal of the new and unexplored. While there is plenty to explore in CS still, freshman will no longer be the ones doing it. We have a new computer science generation, and our program needs to realize this. We will continue to see low enrollment and a high drop rate if something doesn’t change. You cannot scare away all of the students by making them feel dumb and inferior because they cannot remember complicated syntax and expect your academic program to live.

Our introductory courses need to teach different things. Never, in any class I have taken at UNI, has a professor started out a class by mentioning why we should learn what we are learning. They have never mentioned it’s use or importance. In our introductory courses we jumped right in, delving into logical algebra and programming methods, but no one said what we needed to learn the algebra or methods for. I was lucky enough to know how to apply my skills already. Most students don’t. They have no drive because they are merely frustrated by everything they are learning. And who wants to walk up to a professor and admit they have no idea how to compute square roots using for loops? No one. And that is a big problem. I think we should promote a different way of learning. Students find it extremely hard to take program specifications and write a program from scratch. It is hard, and advanced programmers do it. Not new ones. We should develop programs for our freshman to modify, so they can see why code works and how their changes will reflect it. We should promote programming in small steps. If you write a program from scratch, it won’t work the first time. It never does. Maybe because it was late and you initialized your data wrong, or perhaps you thought through your packet filling the wrong way. Whatever the cause, if a student writes it from scratch and it doesn’t work, it is a moral failure. They aren’t used to it. They don’t understand that even the best programmers don’t write bug-free code.

Our students are not motivated. As I listen to classes, people, and friends talk about computer science, they are NEVER talking about what cool project they are working on. They don’t talk about things they are looking into, nor are they talking about things they aspire to make. They talk about how they failed to complete the homework because they had no idea where to start. They don’t have the time to work on something they are actually interested in because they spent four hours trying to fix a program that was probably never going to work the way they wrote it in the first place.

There are two things that bother me the most:

One, the fact that our major is long and requires so much math. Yes, I understand math is great. Yes, I understand we have a science degree and it will not be short because it requires training. But our degree provides no flexibility. If you want to reasonably complete it, you will work your ass off or take a more sane five year plan. I’m still unsure what is with the emphasis on math. I understand it -was- the foundation of our program. And I know it can help understanding, or expand thinking. But seriously, the last math class I technically completed was Geometry. My family life went to hell during Advanced Algebra and Trig, so I never completed the second semester, and I got a C in the first. I am not ashamed to admit this. It proves you do not -NEED- math to succeed in computer science. In fact, apart from simple arithmetic, math is largely unneeded unless you are working on math related problems. Discrete structures is a great class and very applicable. But you don’t need calculus to do it. I got an A- in it, merely missing the A by a percent or two. As a software engineer I’m pretty sure you will rarely deal with real math problems. So what is the math portion of our degree really for? Why not dump it and strongly encourage a mathematics minors? If you do the courses just right, our major is only one or two classes away from a math minor. But, of course, there is no real time to take those classes to get it if you want to graduate in four years.

And two, you never have time to work on projects that matter to you unless you are completely uninvolved on campus and jobless. Most (not all) professors are not flexible enough to allow freedom in their homework. Rarely can I think of times when I could pick homework that was interesting to me. I have gotten more then most, that is for sure. In our introductory classes, students may not know what interests them programming wise. But they certainly will not be interested in mundane examples. Not all students will want to write stuff on their own. But we should encourage them to pick homework that will relate to the topic at hand and help them do it. Everyone knows and can enjoy tic-tac-toe, and it is not hard to program, even for beginners. It demonstrates many programming techniques, such as input/ouput, array manipulation, or potentially AI or even OOD. Yet never has gaming been brought up in any of my classes, except for AI where game playing is part of the curriculum. In fact, in many of my classes, gaming is frowned upon. Never mind the fact that gaming is perhaps the easiest way to relate the material to the current students, and demonstrate any technique, be it lists, trees, arrays, classes, or OOD.

This all bugs me to the point of wanting to switch majors. I won’t, but it seems like a damn good idea. Economics is simple. They take the core of what you need and give it in eighteen hours. Then you choose eighteen more hours of education, putting what you want to learn in your hands. This gives you room in your degree to deal with bad classes, or explore things you are interested in. While I’m not saying computer science can be simplified to thirty six credit hours, I think the BS could provide far more flexibility then it does. Since we do not offer Software Engineering, nor anything else as a major, there is very little point in requiring students to pick an emphasis beyond Foundation courses. Let students take what they are interested in. They will look for jobs they are interested in, and then they will know more of what they need to know for their jobs. I have no real special interest in Information Science, it just seemed like the lesser of all of the evils that are our choices. I wanted to learn about databases, but none of the other courses seem interesting. I think the foundation courses are a good idea, but they should be expanded. Perhaps take a primary course from each section (Info Storage and Ret, Software Engineering, Networking) and put it in the foundations, then remove the emphasis requirements. This would require students to fulfill a requirement and pursue whatever path they would like without punishing them if they decide to change. Maybe students want experience in Real Time Embedded Systems, but have no interest in testing or project management. Currently, students will probably not pursue their desire to learn about real time embedded systems because they will be devoting their emphasis to something they only have one real interest in.

Note: This IS a first draft and WILL be revised.

11.30.07

Planting AA Trees

Posted in Programming, Computer Science, Computer Science III at 1:45 am by Nick

While researching Red-Black trees during my original coding homework sometime ago I stumbled upon the Arne Anderson (AA) tree. Looking back I don’t see why I didn’t push off the Red-Black trees and go with AA Trees. Who knows? Here is an analysis of my work with AA Trees:

AA Tree’s are very similar to Red-Black trees. They add one rule in addition to the traditional Red-Black tree rules, and that is that a Red node may only be a Right child. This eliminates a substantial amount of special cases, making this algorithm far easier to implement. Interestingly enough, AA trees perform as well as Red-Black trees. Since the AA algorithm is simpler, it takes less time to compute, but since it makes more rotations then Red-Black trees the performance evens out. Due to the nature of the algorithm (which shapes the tree), AA trees also tend to have better retrieval times then Red-Black trees. For those unfamiliar with the tree (which is probably most people), there are only two operations required for maintaining the tree (compared to the red-black tree’s five). A right rotation occurs when an insertion or deletion creates a red left child; this is known as a Skew. Split’s perform a (conditional) left rotation when a function creates two red links in a row.

My program implements a string to integer map using AA Trees. As an example, it counts word frequencies in a file. The string handling is rather rudimentary (converts to lower case and tosses out punctuation), so some of the words are kind of funky looking, but it works pretty well. I was surprised at how fast this program ran. I didn’t time it exactly, but it scanned a 11.7MB text file full of text in less than 30 seconds.

During my coding I hit several roadblocks. Coding the algorithm went well, but when I was programming my insertion function it all went screwy. I rewrote it three or four times only to realize that my initialization in my constructor was wrong (while cascading assignments for pointers I forgot the NULL at the end). While stepping through the program for an hour using GDB, I was rather clueless as to what was going on. Everything was wrong. None of the pointers were right, none had key data stored, and it crashed randomly. After fixing this I ran into an error with Skew running on my root pointer, which should never happen. Once I got passed these problems all remaining changes were pretty cosmetic.

I would love to package this up using templates and reuse this tree implementation elsewhere. It is pretty applicable, but again, the STL has a very good built-in map class.

About the code:
I did not flesh this code out much. The tree lacks a deletion function, and I show no examples of how to retrieve a word’s count using Lookup, though it is pretty straight forward. I use Lookup primarily as a wrapper for Insert (and for clarity), differentiating the names for their intended use within the code. There is also a large amount of pointer manipulation in this code, which I tried to document, but it may still be confusing to those new to pointers. Interestingly enough, while the algorithm itself contains a significant amount of references to red and black nodes (which this algorithm is based on), there is no reference to such in the code. The code assumes all Right links are red nodes, and instead everything is based off of the “level” of a node, which is the number of left links that must be taken to reach a NULL pointer.

It comes with a few text files as well. Data.txt is a smaller text file I used to test, being about 500KB or so, which resulted in out.txt. The file test.txt was a small test, and bigout.txt is the result of scanning the whole 11.7MB file.

Code -> http://ew.xidus.net/download/cs3/tree

Note: A wonderful summary, analysis, and comparison between some of the most common efficient data types can be found at http://www.upgrade-cepis.org/issues/2004/5/up5-5Mosaic.pdf. I recommend at least a glance at the diagrams. It is obvious that AA trees are outperformed on many levels, but they hold steady and beat data types on various challenges. Considering the simplicity of the algorithm and ease of implementation compared to the others, I’d say its far more useful.

11.26.07

Simply Linked

Posted in Programming, Computer Science, Computer Science III at 11:31 pm by Nick

It was interesting reading about a data type I had been using ever since I began coding. Little did I know that SMAUG and SWR made use of sequentially linked lists! Experimenting with lists has been fun and interesting. Once I got into programming it was hard to stop adding features. Lists are very easy to manipulate, and once the basics are in its easy to add things.

What I ended up with after this assignment was a class that replicated most features of STL’s built-in list class. It is templated to allow for any data type, and given a couple tests it seemed to work well. I purposefully avoided the use of any STL data types to drive the class, as that would be counter productive in the learning and showing of knowledge about lists.

I experienced many oddities with lists while I programed. After, again, fighting with template syntax, and I had an initial implementation I realized I was indeed missing a vital feature of a list data type: the ability to traverse it. When I gave the Element class the ability to cough up its pointers, it broke all list class functions that needed to set them since that data was now private. I am not totally happy with this implementation because it gives everyone the ability to modify a given elements pointers, not just its data. This is not good since anyone using those functions could potentially screw up the list, and things like that can be very, very hard to track down (nothing like spending hours stepping through a program to evaluate pointers at every step).

However, this was a very vital feature, so once I added it I found I was able to do quite a bit more with the list. An interesting side effect of my implementation is that you can add and remove things to the list without adversely affecting any current iteration (supposing some pointers are handled correctly). I demonstrate this slightly in the example programs provided by inserting an element into a list when a certain point is reached (and then that element is eventually processed like the rest).

When implementing the ability to traverse the list I found it made the code far more obscure. Lines that were already strange looking, like:

x->next->prev = x->prev;

became

(x->GetNext())->SetPrev(x->GetPrev());

And that, as far as readability of code, is very bad. However, its fairly simple to reason out, and with comments I feel no need to separate it into multiple lines of code. Also, each member function of the class is pretty specific and short, and they would normally appear multiple times within code, so each function is inline. I have no real way to tell if this is faster then outline code (is that even what its called?), but it seems reasonable that it should be inline.

Note: The code below is separated into different files for readability. It should be noted that all the code in list.h and list.cpp -NEED- to be in whatever file they will be used by. Therefore, all of the test files (main_int.cpp and main_string.cpp) would need this code copied and pasted into them to work. This has to do with the nature of how C++ uses templates and the code must be present in the source file it is used by. Due to the way I program this is awfully inconvenient, as I love my larger number of smaller file type projects. Also, the output from the two given test programs is provided in text format for examination.

Code -> http://ew.xidus.net/download/cs3/list

11.20.07

Returning to Computer Science: Hashing It Up

Posted in Programming, Computer Science, Computer Science III at 4:53 pm by Nick

While I am in the midst of Thanksgiving break from school, I find myself not taking a break at all. I’m working, and when not working I’m attempting to get school work done, as I have much to do (as always).

In any event, I decided to jump into hashing, and hash tables more specifically. Hash tables are something I’m very used to because when I started programming with SMAUG it had a hash table built into it (if you have a thousand instances of a mob it is an awful waste of space to allocate the same string a thousand times). So I learned the in’s and out’s required to use it, but like all of my knowledge at the time, I knew how to make it work without knowing -why- it worked. That has changed.

Data Structure Techniques by Thomas A. Standish had a lot of good information on hash tables. It taught me many things, even with my preconceived notions of what a hash table did and what its purpose was. I never even considered adding a key by a method other then division (modulo). His text on the multiplicative method is very interesting, though it seems a little too narrow to be useful in a large number of instances.

I initially had trouble testing my different hashing functions for one very simple reason: my computer. Since my last computer bit the dust (the eMachine lasted 5 years with no problems, I’d say thats pretty good quality), I’ve finally invested the money into a -real- computer (more on that here). Simply put, with the original test cases my independent, worst-case variable, the list structure, performed so fast I had to come up with another solution.

As I was in node mood to figure out and implement the use of a high performance timer (known as QueryPerformanceCounter on Windows), I merely used google and grabbed an easy implementation (doesn’t get easier then one structure!). After implementing that, I increased the number of items hashed by about 100 fold, bring most of the searches up to 15 million items. This gave a sufficient look at the differences between hashing functions and their performance compared to a list.

To compare hash functions, I went to Wikipedia (I think) and grabbed a recommended hash function, which happened to be Jenkin’s One-At-A-Time Hash. I then implemented two more very basic hashes. One was based off the SMAUG version, which hashes a string based on its length. The other is based on the first letter of the string. Oddly enough I named that CaseHash, when really it has nothing to do with the case of the first letter, but the letter itself.

The data was rather astounding. Jenkin’s hash almost always performed as well as or better then all other methods. Sometimes it would take a little longer (we are talking less then hundredths of a second longer though), but almost always was the top performer. Note in the final screen shot how it outperforms the other has functions, or performs as well. When doing a particularly hard lookup it would perform much, much better.

Provided via the link below is the source and a couple of screen shots (some have different output because I was busy tinkering). The first time is provided using the time() function, the second time is using the performance timer (with a precision of 15).

Code and screen shots -> http://ew.xidus.net/download/cs3/hash

Notes:
1) I intended for this hash table implementation to be far more flexible, but after fighting with templates for quite some time I decided to merely call it quits and hard code the strings in.
2) In screen shots 1,2, and 3 it appears to be sorting numbers. These are not numbers, but in fact strings that contain incrementing numbers. This was the product of the program I used to generate the data.
3) The programs I used to generate the data are also included in the link above. They are small and simple.
4) All code listed above is public domain. Use it as you like, or don’t use it. Your choice.

03.01.07

Quicksort w/ Stacks

Posted in Programming, ALPHA, Computer Science, Computer Science III at 8:57 pm by Nick

I finally got around to working on this project/assignment. It was a very interesting way to implement quicksort, though I’m inclined to think it isn’t the fastest in many cases. The code picks the first element as the partition point (which works decently for random numbers), but sometimes it is a very poor choice (for example, if the list is nearly sorted already).

The code: http://ew.xidus.net/download/cs3/2

—-

Here is the algorithm I followed (from Data Structure Techniques by Thomas A. Standish using C++ operators):

Quicksort sorts a sequence of n numbers in Table T into ascending order. The numbers in T are referred to as T[1], T[2], …, T[n]. The algorithm uses an auxiliary stack S, which is initialized to the empty stack. Algorithm 2.2 is used as a subroutine to partition subsequences of the numbers in T into suitable subproblems. We assume n is at least 2.

1. [Initialize] Set L = 1, R = n, and S = NULL. (L is the leftmost index of the interval (L, R) in Table T, and r is the rightmost index.)

2. [Partition interval.] Call Algorithm 2.2 with input (L, R) to partition interval (L,R) of T into two intervals: an interval (L1,R1) of numbers < T[L] and an interval (L2,R2) of numbers > T[L].

3. [Postpone larger subproblem] If (R1 - L1) > (R2 - L2), then perform S < = (L1, R1) (that is, push the interval onto the stack), set L = L2, set R = R2 and go to step 4. Otherwise, perform S <= (L2,R2), set L = L1, set R = R1 and go to step 4. (This puts the larger intervals on the stack to be done later and goes to step 4 to sort the smaller ones).

4. [Interval nontrivial?] If R > L then go to step 2. (Otherwise, R < = L so the interval (L,R) contains at most one number and need not be sorted. Thus, the stack can next be examined at step 5 to see if there are any more postponed subproblems to consider.)

5. [Done?] If S == NULL, the algorithm terminates. Otherwise, perform (L,R) <= S (that is, pop the top interval from the stack) and go back to step 4.

-- Algorithm 2.2 --
Using repeated exchanges, this algorithm partitions the distinct numbers in positions T[L], T[L+1], ..., T[R] into two subintervals: the interval (L1, R1) containing numbers less than T[L] and the interval (L2, R2) containing numbers greater than T[L]. The output consists of the ordered pairs of indices (L1, L2) and (L2, R2) for these two intervals. At termination, the numbers originally in T[L] lies between and separates the two subintervals. It is assumed initially that L < R.

1. [Initialize] Set i = L and set j = R.
2. [Decrease j] If T[j] > T[i], set j = j-1 and repeat this step. (This step finds the greatest j such that T[j] < = T[i].)
3. [Exchange or done?] If j > i, then T[i] < -> T[j] (that is, exchange T[i] and T[j]) and go to step 4. Otherwise, set (L1, R1) to (L, i-1), set (L2,R2) to (i+1, R), and terminate the algorithm.
4. [Increase i] If T[i] < T[j], set i = i -1 and repeat this step. (this step finds the least i such that T[i] >= T[j].)
5. [Exchange or done?] If j > i, then T[i] < -> T[j] (that is, exchange T[i] and T[j]) and go to step 2. Otherwise, set (L1, R1) to (L, j-1), set (L2,R2) to (j+1, R), and terminate the algorithm.

—–

The code works really well for smaller vectors. It easily sorts 10k random integers on my crappy machine in less then one second. This code also had a few quirks to it that I didn’t see initially. If you tried to sort sufficiently large vectors it would segfault and using GDB I was able to find that j and i would be corrupted in Partition. To fix this I added and statements so that j and i had to be in a realistic range (that is, j could be no less then L and i could be no greater then R).

Also, as the book says, if you have a nondistinct list of integers (that is, a list that contains two or more of the same number) then it would not work. It would find its way into a constant swapping of data. To fix this I changed the loops so that it would be >= for decrementing j and < = for incrementing i. Along with this I changed the swaps to be ( j > i && (v[j] != v[i])). Thus, if the numbers were equal it would not swap. These changes appear to allow the list to sort a nondistinct list correctly.

Lastly, I figured out the program was taking a crazy long time. It would take me 35 seconds to sort 10k elements. I added a statement to see how fast I was generating random numbers and I was surprised when it came back and showed me 35 seconds of the run time was generating random numbers. I examined my code to see how I could speed it up and I realized I had been adding elements to the front of a vector. This, of course, is a terrible method of doing things since it has to shift up all of the elements constantly. After I changed it to add to the end the time went drastically down, being able to generate 3 million random integers in 2 seconds.

All in all this code works really well and is quite fast. I would love to throw it up on Sidhe and see how fast it sorts compared to my previous qsorts, which could sort 5 million random integers in 14 seconds on Sidhe. On my home machine I ran a sort of 3 million integers (that took 2 seconds to generate) and it sorted in 2061 seconds (34.35 minutes). I ran it a second time and it sorted in 2283 seconds (38.05 minutes). I’m not sure what it chose for an index on either run, but those are not very good sort times in my opinion. I plan to test it using 499 as the partition point and then see how fast it runs. Of course my machine is several years old, so 34 minutes might not be a bad time. :P

Note: The code has two special test cases. If you enter 9 you will get the books sorting example. I did this for bug testing (using the comprehensive output) so I could see step by step what the program was doing. If you enter 11 it gives you the book example (of 9 numbers) and two of the same number at different points in the vector. This was used to test the code against nondistinct number lists.

02.17.07

Computer Science At Last, ISR Homework

Posted in Programming, School, College, ALPHA, Computer Science, Info Storage and Retrieval at 5:41 pm by Nick

At long last I have an update! A real update. I have been delving ever so slightly into the realm of Information Storage and Retrieval. This wasn’t my first time writing code to build a dictionary of terms, but other lessons have been interesting. Generating effective stop lists and weighting terms are all very practical techniques. And our textbook is free!

Assignment 1:

Read the computing abstracts file and generate a dictionary of terms. Convert terms to lower case remove extraneous endings and special characters. Sort the dictionary by frequency and turn in, along with the program, the first page of the sorted results.

The code, library, and related files:
http://ew.xidus.net/download/isr/1/

Notes:
I spent more time on this then it required simply to get a workable namespace going. All classes and related functions will go into the ISR library for use in later homework, and, at the end, I’ll have a workable Info Storage and Retrieval library with tested code.


Assignment 2:

Using the dictionary from the above, select high frequency and low frequency words and from these build a file that will be a stop list. Write a program that will process the abstracts (title and abstract). Your program will read each word in each title/abstract text and ignore words that are in the stop list. It will build a new dictionary vector giving the total number of times each word occurs and a document frequency vector giving the number of documents in which each term occurs. It will then calculate, for each word, the Inverse Document Frequency weight of the word. Hint: for each document, build the dictionary directly and generate a document-term matrix. Then after all documents have been processed, create the document-frequency vector by going through the document vectors and incrementing the D-F vector for each word found

The code, library, and related files:
http://ew.xidus.net/download/isr/2/

Notes:
This assignment looked daunting, but when I got started it was actually relatively easy. It was just a lot of work. I hit a few spots that proved troublesome, mainly due to the reading of the file (I should have written my own reading routine that processed each character instead of extracting from std::cin) and my being tired. Overall I’m fairly proud of this code as I didn’t know how to do any of it when I started. However, as with most code I look back on, I see things I definitely could improve. This code isn’t particularly efficient in my opinion, but overall it still runs fairly fast.

11.06.06

Programming Humor

Posted in Programming, Computer Science at 5:42 pm by Nick

C Programming Comic

Thanks Lisa!

10.23.06

Promising Outlook

Posted in General, Computer Science at 10:50 pm by Nick

O’kane sends things like this out occassionally, perhaps as a morale booster.

http://www.eweek.com/article2/0,1895,2034739,00.asp?kc=EWEWEMNL102306EP16B

Salaries for software developers are expected to rise 5.1 percent to the range of $60,250 to $94,750 in 2007. Base compensation for both Web developers and data warehouse managers is expected to increase 4.2 percent in 2007, with annual ranges of $54,750 to $81,500 and $85,500 to $113,500, respectively.

Many inelligent seniors and some college freshmen ask me about financial outlook for computer science majors. I would say its played a vital role in some of the people I have persuaded. If you are looking for a good, safe, work environment and very good pay for a four year degree, you can’t go wrong with comp. sci.

09.09.06

ALPHA - Fall ‘06 Project

Posted in Programming, General, ALPHA, Computer Science at 11:42 pm by Nick

It has been very difficult to determine what I wanted to do this year. I’ve neglected my computer so much recently I’m sure it should be something computer related.

On that note, two things come to mind. One, my long running interest (and chance to become even minorly famous among computer sciene nerds) , would be to attempt to speed up or break A*. This would be difficult, but I believe it can be done. Alternatively, I wanted to save such a thing for my undergraduate thesis.

Second, create a basic typing library for C++ to emulate strong typing conventions and subtypes like those used in Ada. This would certainly be helpful, and with a little work and dynamic approach could even become popular. 88% of programming errors come from integers with large domains that end up with unexpected values according to McCormick. Thus, it would be nice to find a way to limit this.

Using templates, STL, classes and inheritance I think I could create a good library to accomplish the said goals in C++. This would make for less errors and more compiler help (rather than runtime errors). Yay.

The only other option would be to create a very lightweight MySQL library in Ruby for use with Kaladea. And, for that matter, I should finish answerings Gammon’s questions on Kaladea, and finish reading the material posted in the BabbleMUD wiki. ::sigh:: So much to do, so little time. Further that note, I need to finish reading the script for the play and finding the sounds, and the robotics team starts up this tuesday.

There’s a thought. Maybe the robots programming can be my project? Chances are it will be like last year: a broad collection of programs of varying degrees of difficulty with the only point being to try new things.

We shall see.

« Previous entries ·