Page 3 of 5

Re: Combat Revamp Discussion

Posted: Fri Aug 03, 2012 4:41 pm
by Jabantiz
This post is mainly me putting ideas out for myself, if you agree, disagree, or have any opinon at all let me know, trying to squash this damn bug.

When we get a zone deadlock the client can run around the zone no problem but any commands the client tries don't get processed, clients in other zones that send commands that can effect the first client (chat mainly haven't thought of trying others until now like /invite) the first client recieves these msgs. This leads me to believe that Client::HandlePacket() is never called, that is only ever called from Client::Process() wich is called from the zoneservers main thread. If Client::Process fails on the zone thread the client should get disconnected or if an exception is thrown we should get the "Exception caught when in ZoneServer::ClientProcess()" message (wich we don't).

I think the issue is here some where, not sure what though. I am still unable to duplicate the issue even with the db from server pack 1.3, putting logging in this function will be spammy as hell also, we could try moving it to its own thread but not sure that would have much of a benefit.

Re: Combat Revamp Discussion

Posted: Fri Aug 03, 2012 5:40 pm
by John Adams
Jabantiz wrote:putting logging in this function will be spammy as hell also, we could try moving it to its own thread but not sure that would have much of a benefit.
When I first added TRACE logging, I had it at every function enter/exit for a few modules. Due to the *Process() functions constantly spinning, that got to be a nightmare in about 5 seconds.

Logs were writing so fast, they were totally overwriting the middle of each other... which is why I asked Scatman to queue the log output. He said he would, but hasn't gotten to it yet.

Second suggestion, put in your spammy logging... leave them DISABLED by default, and use log_config.xml to turn them on to write ONLY to the file (Logs = 1) and you might get something out of it.

Eventually I will put all TRACE logging back in but not until I have queued logging.

Re: Combat Revamp Discussion

Posted: Fri Aug 03, 2012 9:11 pm
by Jabantiz
By moving ZoneServer::ClientProcess() to its own thread it seems to have stopped the zone from freezing/deadlocking. I have been on EQ2TC for 3 hours now doing random crap in the same zone and it is all still running good. The issues I noticed while fighting with /invul 1 seem to be gone as well.

Edit: 6 hours and every thing is still working, hopefully this fixed the issue.

Re: Combat Revamp Discussion

Posted: Sun Aug 05, 2012 1:37 pm
by John Adams
Sitting in EQ2TC now, barely 20 mins, 6 players, 1 in Ruins, 5 in QC, and QC is locked up. Ruins is still functioning.

My setup is:
5 players in QC
2 of them are in a group together (Guardian and Templar)
3 more just standing around QC
1 remaining player (Ranger) is in Ruins

All I did was rebuild the Templar's hotbar with spells she has, then tried casting Aegolism on myself, to no affect (E Spell: SpellProcess::ProcessSpell Unable to find any spell targets for spell 'Aegolism'.). In trying to research why this spell wasn't landing on even myself, I was changing some spell parameters (target_type) and did 1 /reload spells (only 1). I could not get Aegolism to land on self, so I added Guardian to Templar's group. Now, the group has 2 members (only).

Note: changing spells info was moot, since the spell data is on Dev, and I am on EQ2TC - oops!

I cast Aegolism again, this time target_type = Group AE, but still got the same message. No targets. And that's where I went to code, and left chars standing in zone.

I was walking through GetSpellTargets() commenting code so I understood what each convoluted "if" does, I thought I'd move my toons to Ruins to stand with the orcs fighting. This is when I noticed /zone did nothing on all my toons (except Ranger, who is in Ruins). /who all no longer works for QC chars, but does for Ruins char. Zero feedback from World that anything is wrong.

Only 1 client was doing anything, btw. The Templar, casting spells. Everyone else was just standing around. Hope this info helps. I will attempt to reproduce this same deadlock.


Edit: I was compiled in Release x64 btw.

Re: Combat Revamp Discussion

Posted: Sun Aug 05, 2012 2:34 pm
by John Adams
And of course, now that I am in Debug_x64 mode, I cannot reproduce the issue. Sigh.

Re: Combat Revamp Discussion

Posted: Fri Aug 10, 2012 11:22 am
by John Adams
Okay, let's talk stability. I am doing back-flips how much better Combat functions now. Seems everything related to combat timers and loops seems to be 100x better than it was a month ago, so I think whatever was done was definitely a plus in that respect. The reason I am a BIG FAN of removing Mutex stuff for vector<> or whatever, is because a year+ ago, Scatman already suggested we do this, we just never had time to... so I think Jabantiz is on the right track.

However, the zone exceptions and crashing World takes some of the "shine" off these positive efforts. I just need an assessment on a) are these new crashes even related to Combat at all (we changed a few things)? and b) whether it is possible to regain stability, quickly or c) put Mutex back in and fix it hoping we understand it better now.

Thoughts?

Re: Combat Revamp Discussion

Posted: Fri Aug 10, 2012 2:43 pm
by Jabantiz
The current zone exception that has me stumped is related to ld code, and that is the only one I know of, there is the deadlock issue (what you posted above) but that was around before and doesn't happen nearly as often now but seems random and haven't been able to track it down. What other zone exceptions are happening that I have forgot or am not aware of?

Re: Combat Revamp Discussion

Posted: Fri Aug 10, 2012 3:24 pm
by John Adams
Well, because the code does not offer any suggestions about what is threw an error on, I cannot say. I know LD sometimes has issues (inconsistently), and when I /zone or /camp and get a crash, all I know is that is what I did to make it happen. I've had issues with multiple clients doing things at once, to each other or to NPCs, kinda all over the map. Mostly they have been discussed. Unless we take out the Try/Catch and attempt to trace the crashes, I don't see how we'll ever know what is really causing it.

I can try this on EQ2TC, but the player population has all but withered away. Up to us to crash it, I suppose.

Re: Combat Revamp Discussion

Posted: Sat Aug 18, 2012 4:18 pm
by John Adams
Chasing down Alfa's lag report, I can definitely see what he's talking about. Here's the process CPU utilization with my character just standing in Timorous Deep, with most logging disabled -
cpu.jpg
I cannot say if this is the new changes to Combat, or something else that creeped in as we're adding things. Since Jabantiz has run out of time to spend on this right now, I will take a stab at it. First, I'm going to revert to before Jab did any work, just to see if the MutexList/Map code he removed behaved the same (I'm almost sure it did). If so, then it's definitely not the new combat system changes, and I'll have to rely on Scatman to help me troubleshoot.

This is pretty much a show-stopper for us. There is no sense adding 1 more feature to this mess we have already until these problems are worked out. Hopefully, soon. I'd like to fix this and release it as 0.7.1 and start our next cycle.


Edit: Hah, well I guess that answers that question. Here's the process CPU utilization running the exact same test in "Release" (non debug) mode -
cpu-release.jpg
I knew DEBUGging was slower, but man, I didn't think it was THAT horribly slow. Unfortunately, I cannot debug crashes in Release, so you're going to have to suffer with Lag for now ;)

Re: Combat Revamp Discussion

Posted: Sat Aug 25, 2012 9:17 am
by John Adams
Well here's something interesting. Another crash, CombatProcess(),

Stack:

Code: Select all

>	EQ2World__Debug.exe!ZoneServer::CombatProcess()  Line 884 + 0x18 bytes	C++
 	EQ2World__Debug.exe!CombatLoop(void * tmp)  Line 4309	C++
 	EQ2World__Debug.exe!_callthreadstart()  Line 259 + 0xf bytes	C
 	EQ2World__Debug.exe!_threadstart(void * ptd)  Line 243	C
 	kernel32.dll!77e6482f() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for kernel32.dll]	
Code:

Code: Select all

bool ZoneServer::CombatProcess() {
	bool ret = true;
	MutexList<Spawn*>::iterator itr = spawn_list.begin();
	while (itr.Next()) {
		if (itr->value->IsEntity())  <===
			if (!combat->Process((Entity*)itr->value)) {
				ret = false;
				break;
			}
	}
	return ret;
}
The interesting info is in the last few lines of the World log. Note the # of quests? Might be a bad printf value, and the fact the client going LD is "Null", which is likely our crash this time.
10:49:32 D LUA: Quest: Running Off the Grobin Scouts, function: QuestComplete
10:49:32 D LUA: Done!
10:49:32 D Client: Send Quest Journal...
10:49:32 D Client: Send Quest Journal...
10:49:41 E Command: Error in COMMAND_ACCEPT_REWARD. No pending quest or collection reward was found (unknown=0).
10:50:00 D Client: Found 903850984 pending quests for char_id: 135
10:50:00 D LUA: Quest: Grobin Trouble at the Pond, function: Accepted
10:50:00 D LUA: Done!
10:50:00 D Client: Send Quest Journal...
10:50:01 D Client: Found 903850984 active quests for char_id: 178
10:50:25 D LUA: Found LUA Spell Script: 'Spells/Scout/Tracking.lua'
10:51:44 D Zone: Client is disconnecting in ZoneServer::ClientProcess (camping = false)
10:51:44 D Zone: Sending login equipment appearance updates...
10:51:44 D Zone: Calling clients.Remove(client)...
10:51:44 D Zone: Removing client 'Null' (178) due to LD/Exit...
10:51:44 I Zone: Scheduling client 'Null' for removal.

10:51:44 D Player: Toggling Character OFFLINE!
10:51:44 D CClient: Client Disconnect...
10:51:44 D Zone: Starting zone shutdown timers...
10:51:58 D World: Removing connection...

Soon as my house guests leave (Mon) I will start my attempt to root out this problem by reverting combat system or LD code changes.

Re: Combat Revamp Discussion

Posted: Wed Aug 29, 2012 5:55 pm
by Jabantiz
I had an idea that was easy enough for me to implement that I could do it and do basic tests with my extremely limited time right now. From the tests that I did it worked a lot better then I had hoped, my tests were very limited though. What I did was moved client process back to the main zone thread (forgot the reason why I moved it to its own and it seemed to just cause new problems) and merged the combat thread with the spawn thread.

This also may have fixed the ld zone crash, I was not able to reproduce that crash. Right now ZoneServer has 2 threads running all the time, the main zone thread and the spawn thread, the spawn thread may be able to be improved though as currently it calls several function that loop through the entire spawn list, would probably be better to just loop through the spawn list once and do every thing that is needed instead of looping through it 5+ times. I would do this myself but I don't have the time to go over all the functions in ZoneServer::SpawnProcess()

I am commiting this code now but I would really appreciate it if some one can do more extensive tests on it.

Re: Combat Revamp Discussion

Posted: Fri Aug 31, 2012 8:19 am
by John Adams
Good work. Sounds positive. I will put this on EQ2TC right now, let it run and we'll see the results. You can ignore your PM if you think this resolves it.

Re: Combat Revamp Discussion

Posted: Fri Aug 31, 2012 7:27 pm
by Jabantiz
I did some tests on EQ2TC, mostly ld related but I did do some combat. I did get the duplicate packet message but only when I logged on another char into a diffrent zone (logged into ruins other char was sitting in ant). All my ld attempts, single client and multi client, have not caused a zone exception. As far as combat goes, there was some lag but nothing to terribly bad. I did notice spell icon shading was reversed for some reason though (shaded when ready to cast unshaded when not ready) didn't notice that on my server but will check in a bit. Over all I think this last change was a step in the right direction.

Re: Combat Revamp Discussion

Posted: Sat Sep 01, 2012 5:05 pm
by Jabantiz
Ignore the previous post, EQ2TC wasn't running the latest code when I tested.

My new test was 3 clients, 2 sitting in antonica, 1 in queens colony. I did combat with the one in queens colony for a while befor I LD'd one of the clients in antonica, then continue killing stuff while I waitied for the LD timer to expire, there was no zone crash. I continued to kill stuff for a little while then I LD'd both clients and waited, again no zone crashes. Combat was smooth, no lag what so ever however the icon shading was wacky at best, don't have that issue on my server but my server runs in debug, will test it in release in a little bit. There was also no Duplicate packet messages this time.

Overall this may be the best solution so far, it will still require some testing though.

EDIT: Getting strange behavior with shaded icons on a 64bit release compile, not as bad as on EQ2TC though. Didn't notice this in debug and my compiler crashes when I try to do 32 bit release. This could either be an issue with combat (spells not getting the right status) or with my passive spells changes. Ran out of time today to track it down though.

Re: Combat Revamp Discussion

Posted: Sun Sep 02, 2012 12:13 am
by John Adams
EQ2TC should have had the latest code as if 8.31.2012 (note the date on the Debug compile). I run EQ2TC in Debug, else I cannot trap the crashes. I see it is now in Release, which - as far as I know - will not show the Dupe Packet or Future Packet stuff, pretty sure that is only if the code is Debug, not Release. At least that's how I've seen it function the last 5 years. I could be wrong.

I will put Debug back up on EQ2TC, and it is an x86 machine as well. Dev is x64, but Linux. Not sure if the chipset should have anything to do with odd spell shaders, could just be a coincidence. Soon as a few other players get online, we can see if the zone exceptions happen again. Btw, there was no "crash" really, just the zone exceptions - and the zone never shut down.

I shouldn't say "never", because you engineers will cling to that ;) I do mean, most of the exceptions I've seen lately have simply been Exception caught in so and so zone. When I post a call stack, that is a world crash - which has been more frequent lately than in the last 2 years. But, we'll get it figured out. Reverting the code didn't seem to help one bit... so I am completely confused.