"My Comments" algorithm to be perfected November 28, 2005 9:27 AM   Subscribe

I suspect the "My Comments" algorithm needs a little tweaking? [MI]
posted by nobody to Feature Requests at 9:27 AM (21 comments total)

My understanding was that it's meant to show threads in which you have previously commented in which other people have made additional comments within the past X number of days. It looks like the threshold is 6 or 7 days.

But it looks like the script might be written such that your own comment must fall within the previous 7 days, and not just others' comments that come after yours.

I guess this might be intentional, but then it seems the less helpful of the two possibilities (and means "My Comments" and "My Posts" don't really function in a parallel fashion).

(Examining further, while the header buttons and HTML titles say "My Posts" and "My Comments," the pages themselves say at the top "Recent Activity on my posts" and "My Recent Comments," meaning that I guess I ought to change this post's category from "bug report" to "feature request.")
posted by nobody at 9:28 AM on November 28, 2005


(I bothered posting this because the current "My Comments" page lulled me into thinking a thread I had participated in was dormant, when in fact it was still bustling along.)
posted by nobody at 9:32 AM on November 28, 2005


Huh. I'd been operating under the same assumption. Just because I shut up in a thread doesn't mean I don't want to see how it plays out.
posted by cortex at 9:46 AM on November 28, 2005


The title on the pages say "my recent comments" and "recent activity on my posts" but I shortened the titles in the nav when "my recent comments" was deemed too long in the redesign threads.

It's an issue of limited resources. To generate the page, I have to check for a user's comment in one of three comment database tables, and it needs a time cutoff to save resources. The first few tests that went back through the entire 1million+ comments looking for a user's comments took 30 seconds to a minute to complete. To save resources, I just look at the last week for comments you've made.

Seeing how mefi and metatalk threads are active for 30 days, let me up it to 30 to see if that drags it down much further.
posted by mathowie (staff) at 10:12 AM on November 28, 2005


mathowie writes "It's an issue of limited resources. To generate the page, I have to check for a user's comment in one of three comment database tables, and it needs a time cutoff to save resources. The first few tests that went back through the entire 1million+ comments looking for a user's comments took 30 seconds to a minute to complete."


Matt, what are you using for a database? There's no good reason for this to require a table scan of the entire 1 million+ records. (There's also limited utility in having three tables rather than one, but that's more understandable.)

If you indexed the table on user id, your retrieval time would drop by a couple orders of magnitude. Only slightly more space costly, index on user id and comment id (I'm presuming comment ids are assigned sequentially); then pulling out the last N comment ids would be a sequential read of that index.

And note, the index is only minimally space costly (two ints), and the time cost is on the update, which is (relatively) infrequent. It's also a minimum time cost, because no lookup is required -- it's exactly the same data (user id and comment id) you're already inserting.
posted by orthogonality at 10:39 AM on November 28, 2005


So I tried it going back 30 days, and it was roughly 2-3x the processing time to grab my comments, then the server process died after a few minutes, so I'm thinking maybe it's a bit too much to keep it at 30 days, so I put it back to 7 days.

The assumption here is that threads you take part is rarely go on for more than 7 days.
posted by mathowie (staff) at 10:41 AM on November 28, 2005


Everything is indexed on user_id already. I only searched every comment ever as an early test to see how long it would take, that's not what the script was designed to do, it always just plumbed the last 7 days for comments. And I agree on the multiple comment tables. If I would have known there'd be more sections of the site I would have just made a single master comment table.
posted by mathowie (staff) at 10:43 AM on November 28, 2005


mathowie writes "Everything is indexed on user_id already. I only searched every comment ever as an early test to see how long it would take, that's not what the script was designed to do, it always just plumbed the last 7 days for comments. "

Perhaps the problem is in how you're limiting it to seven days (you may be forcing the db to compare the dates, which would force table scans and reads of the date). You want everything you're doing in the query to be available in an index; in particular "last 10 comments for user" is easier than "comments in the last week for user". But if doing it by week is a requirement, try adding the date to an index.

In any case, do a showplan for the actual query, and see what it's doing. I don't know what database you're using, but most have a "showplan" or "explain" setting.
posted by orthogonality at 10:56 AM on November 28, 2005


Oh, well.

For me (who comments much less prolifically), it looked perfect during the short window of 30 day querying (and didn't seem like a much longer load time). It was still only showing threads with comments made in the past week (which is good), but was including threads in which my last comment was more than a week ago.

Would you consider making it a user-defined option?

(I guess this will be a moot point once the "watch this thread" feature is implemented.)
posted by nobody at 10:59 AM on November 28, 2005


I had been wanting to know which way that page worked for a while. I wanted to know if it was 7 days since my comment or 7 days since a comment at all. Thanks for bringing this up, nobody.

I second the notion that it's less useful this way.
posted by raedyn at 11:19 AM on November 28, 2005


"last 10 comments for user" is easier than "comments in the last week for user"

What if the user is new and has only left 4 comments to date?
posted by mathowie (staff) at 11:21 AM on November 28, 2005


Ok, I did some new indexes based on MS SQL's query analyzer and it seemed to speed up the 30 day query about 25%. So I'll leave it at thirty days for a while today to see how that goes.

orthogonality, I'm running microsoft sql server 2000. If you have any experience with it, I'd be happy to talk about it offline. I could use some dba help.
posted by mathowie (staff) at 11:24 AM on November 28, 2005


Thanks mathowie.
posted by raedyn at 11:51 AM on November 28, 2005


Have you thought about denormalizing or adding reporting tables? I know there are a few DBAs here that would love to help.
posted by blue_beetle at 12:12 PM on November 28, 2005


mathowie writes "Ok, I did some new indexes based on MS SQL's query analyzer and it seemed to speed up the 30 day query about 25%. So I'll leave it at thirty days for a while today to see how that goes.

"orthogonality, I'm running Microsoft sql server 2000. If you have any experience with it, I'd be happy to talk about it offline. I could use some dba help."


Yeah, I have some experience with Microsoft sql server; its internal reminds me a lot of Sybase's Transact SQL, but it does some damned odd things.*

If you want to talk offline, Jessamyn knows my email (the email on my profile no longer works).

* Just for database geeks: Consider the following stored procedure (for convenience, it's written in Sybase's T-SQL)

create proc foo( @firstname varchar(255), @lastname varchar(255) ) as
select * from person a
where
( @firstname is null or a.firstname like @firstname )
and
( @lastname is null or a.lastname like @lastname )

Now, this works quite efficiently in Sybase, and effectively gives you an overloaded function: pass either (or both) of firstname and lastname, and only matching records are returned; leave either (or both) null, and the result set isn't restricted by that attribute.

When either parameter is null, Sybase correctly treats the "or" as short-circuited, and the results are returned without comparison on that attribute.

For the same function MS SQL server doesn't short-circuit the "or", and apparently doesn't even cache the result of the expression "@lastname is null". (Note, we're testing the parameter against null, not the row value; so this test only needs to be done once.)

So MS SQL Server does a table scan even if both parameters are null, and does all comparisons for each row.

To write a function with the same functionality and efficiency under MS SQL Server, you need to branch with "if"s to separate select statements -- which in turn causes a combinatorial explosion, as each additional parameter doubles the number of individual branches and selects needed. In practice, this is a mess, so you just don't do it.

posted by orthogonality at 4:31 PM on November 28, 2005


Thank heavens I can keep getting updates on that portobello mushroom thread now.
posted by Wolfdog at 6:06 PM on November 28, 2005


"Yeah, I have some experience with Microsoft sql server; its internal reminds me a lot of Sybase's Transact SQL"

That's because it is Sybase.
posted by Ethereal Bligh at 6:38 PM on November 28, 2005


Ethereal Bligh writes "That's because it is Sybase."

Yeah, I remember hearing that. What I don't understand is why the query plans can be so different. What Sybase version is MS SQL based on?
posted by orthogonality at 6:41 PM on November 28, 2005


I don't know offhand. It would be cheating to Google.
posted by Ethereal Bligh at 6:50 PM on November 28, 2005




I now officially love My Comments, without it I never would have known that Orthogonality had tried to snark me. And missed.
posted by The Monkey at 7:51 PM on November 29, 2005


« Older NYC meetup dec 2005   |   ASL? Newer »

You are not logged in, either login or create an account to post comments